Most of my time these past few months has been taken up almost completely with the preparations for releasing 3.7 of GNU Radio. It's a huge task and has needed a lot of work and careful scrutiny.
But I managed to find some time today to look into a question that's been bugging me for a bit now. A few months ago, we introduced the Performance Counters into GNU Radio. These are a way of internally probing statistics about the running radio system. We have currently defined 5 PCs for the GNU Radio blocks, and each time a block's work function is called, the instantaneous, average, and variance of all five PCs are calculated and stored. The PCs are really useful for performance analysis of the radio, possibly even leading to an understanding of how to dynamically adjust the flowgraph to alleviate performance issues. But the question that keeps coming up is, "what is the performance cost of calculating the performance counters?"
So I sat down to figure that out, along with a little side project I was also interested in. My methodology was just to create a simple, finite flowgraph that would take some time to process. I would then compare the run time of the flowgraph with the performance counters disabled at compile time, disabled at runtime, and enabled. Disabling at compile time uses #ifdef statements to remove the calls from the scheduler completely, so it's like the PCs aren't there at all (in fact, they aren't). Disabling at runtime means that the PCs are compiled into GNU Radio but they can be disabled with a configuration file or environmental variable. This means we do a simple runtime if() check on this configuration value to determine if we calculate the counters or not.
The hypothesis here is that disabling at compile time is the baseline control where we add no extra computation. Compiling them in but turning them off at runtime through the config file will take a small hit because of the extra time to perform the if check. Finally, actually computing the PCs will take the most amount of time to run the flowgraph because of the all of the calculations required for the three statistics of each of the 5 PCs for every block.
The flowgraph used is shown here and is a very simple script:
It sets up a null source and sink and a filter of some arbitrary taps and some length. The flowgraph is run for N items (1 billion here). I then just used the Linux time command to run and time the flowgraph. I then keep the real time for each of 10 runs.
|PCs Disabled||PCs On||PCs Off||Affinity Off|
|% Diff. Avg||0.000||0.180||0.220||13.482|
|% Diff. Min||0.000||0.068||0.004||12.758|
|% Diff. Max||0.000||0.480||0.748||13.792|
The results were just about as predicted, but also somewhat surprising, but in a good way. As predicted, having the PCs compiled out of the scheduler was in fact the fastest. If we look only at the minimum run times out of the 10, then turning the PCs off at runtime was the next fastest, and doing the computations was the slowest. But what's nice to see here is that we're talking much less than 1% difference in speed. Somewhat surprisingly, though, on average and looking at the max values, the runtime disabling performed worse than enabling the PCs. I can only gather that there was some poor branch prediction going on here.
Whatever the reasons for all of this, the take-away is clear: the Performance Counters barely impact the performance of a flowgraph. At least for small graphs. But if we're calculating the PCs on all blocks all the time, what happens when we increase the number of blocks in the flowgraph? I'll address that in a second.
First, I wanted to look at what we can do with the thread affinity concept also recently introduced. I have an 8-core machine and the source, sink, and head block take very little processing power. So I pushed all of those onto one core and gave the FIR filter its own core to use. All of the PC tests were done setting this thread affinity. So then I turned the thread affinity off while computing the compile-time disabled experiments. The results are fairly shocking. We see a decrease in the speed of this flowgraph of around 13%! Just by forcing the OS to keep the threads locked to a specific core. We still need to study this more, such as what happens when we have more blocks than cores as well as how best to map the blocks to the cores, but the evidence here of it's benefits is pretty exciting.
Now, back to the many-block problem. For this, I will generate 20 FIR filters, each with different taps (so we can't have cache hiding issues). For this, because we have more blocks than I have cores, I'm not going to study the thread affinity concept. I'll save that for later. Also, I reduced N to 100 million to save a bit of time since the results are being pretty consistent.
|PCs Disabled||PCs Off||PCs On|
|% Diff. Avg||0.000||0.655||0.567|
|% Diff. Min||0.000||0.460||0.545|
|% Diff. Max||0.000||0.736||0.904|
These results show us that even for a fairly complex flowgraph with over 20 blocks, the performance penalty is around a half a percent to close to a percent in the worst case.
My takeaway from this is that we are probably being a bit over-zealous by having both the compile-time and run-time configuration of the performance counters. It also looks as though the runtime toggle of the PCs on and off seems to almost hurt more than it helps. This might make me change the defaults to have the PCs enabled at compile time by default instead of being disabled. For users who want to get that half a percent more efficiency in their graph, they can go through the trouble of disabling it manually.