Most of my time these past few months has been taken up almost completely with the preparations for releasing 3.7 of GNU Radio. It's a huge task and has needed a lot of work and careful scrutiny.

But I managed to find some time today to look into a question that's been bugging me for a bit now. A few months ago, we introduced the Performance Counters into GNU Radio. These are a way of internally probing statistics about the running radio system. We have currently defined 5 PCs for the GNU Radio blocks, and each time a block's work function is called, the instantaneous, average, and variance of all five PCs are calculated and stored. The PCs are really useful for performance analysis of the radio, possibly even leading to an understanding of how to dynamically adjust the flowgraph to alleviate performance issues. But the question that keeps coming up is, "what is the performance cost of calculating the performance counters?"

So I sat down to figure that out, along with a little side project I was also interested in. My methodology was just to create a simple, finite flowgraph that would take some time to process. I would then compare the run time of the flowgraph with the performance counters disabled at compile time, disabled at runtime, and enabled. Disabling at compile time uses #ifdef statements to remove the calls from the scheduler completely, so it's like the PCs aren't there at all (in fact, they aren't). Disabling at runtime means that the PCs are compiled into GNU Radio but they can be disabled with a configuration file or environmental variable. This means we do a simple runtime if() check on this configuration value to determine if we calculate the counters or not.

The hypothesis here is that disabling at compile time is the baseline control where we add no extra computation. Compiling them in but turning them off at runtime through the config file will take a small hit because of the extra time to perform the if check. Finally, actually computing the PCs will take the most amount of time to run the flowgraph because of the all of the calculations required for the three statistics of each of the 5 PCs for every block.

The flowgraph used is shown here and is a very simple script:

-------------------------------------------------------------------------------

from gnuradio import gr, blocks, filter

import scipy

def main():

N = 1e9

taps = scipy.random.random(100)

src = blocks.null_source(gr.sizeof_gr_complex)

hed = blocks.head(gr.sizeof_gr_complex, int(N))

op = filter.fir_filter_ccf(1, taps)

snk = blocks.null_sink(gr.sizeof_gr_complex)

op.set_processor_affinity([7,])

src.set_processor_affinity([6,])

hed.set_processor_affinity([6,])

#snk.set_processor_affinity([6,])

tb = gr.top_block()

tb.connect(src, hed, op, snk)

tb.run()

if __name__ == "__main__":

main()

-------------------------------------------------------------------------------

It sets up a null source and sink and a filter of some arbitrary taps and some length. The flowgraph is run for N items (1 billion here). I then just used the Linux time command to run and time the flowgraph. I then keep the real time for each of 10 runs.

	PCs Disabled	PCs On	PCs Off	Affinity Off
	54.61	54.632	55.141	62.648
	54.35	54.655	54.451	61.206
	54.443	55.319	54.388	61.787
	54.337	54.718	54.348	62.355
	54.309	54.415	55.467	61.729
	55.055	54.318	54.314	61.526
	54.281	54.651	54.581	62.036
	54.935	54.487	55.226	61.788
	54.316	54.595	54.283	61.817
	54.956	54.785	54.593	62.255
Avg.	54.5592	54.6575	54.6792	61.9147
Min.	54.281	54.318	54.283	61.206
Max.	55.055	55.319	55.467	62.648

% Diff. Avg	0.000	0.180	0.220	13.482
% Diff. Min	0.000	0.068	0.004	12.758
% Diff. Max	0.000	0.480	0.748	13.792

The results were just about as predicted, but also somewhat surprising, but in a good way. As predicted, having the PCs compiled out of the scheduler was in fact the fastest. If we look only at the minimum run times out of the 10, then turning the PCs off at runtime was the next fastest, and doing the computations was the slowest. But what's nice to see here is that we're talking much less than 1% difference in speed. Somewhat surprisingly, though, on average and looking at the max values, the runtime disabling performed worse than enabling the PCs. I can only gather that there was some poor branch prediction going on here.

Whatever the reasons for all of this, the take-away is clear: the Performance Counters barely impact the performance of a flowgraph. At least for small graphs. But if we're calculating the PCs on all blocks all the time, what happens when we increase the number of blocks in the flowgraph? I'll address that in a second.

First, I wanted to look at what we can do with the thread affinity concept also recently introduced. I have an 8-core machine and the source, sink, and head block take very little processing power. So I pushed all of those onto one core and gave the FIR filter its own core to use. All of the PC tests were done setting this thread affinity. So then I turned the thread affinity off while computing the compile-time disabled experiments. The results are fairly shocking. We see a decrease in the speed of this flowgraph of around 13%! Just by forcing the OS to keep the threads locked to a specific core. We still need to study this more, such as what happens when we have more blocks than cores as well as how best to map the blocks to the cores, but the evidence here of it's benefits is pretty exciting.

Now, back to the many-block problem. For this, I will generate 20 FIR filters, each with different taps (so we can't have cache hiding issues). For this, because we have more blocks than I have cores, I'm not going to study the thread affinity concept. I'll save that for later. Also, I reduced N to 100 million to save a bit of time since the results are being pretty consistent.

	PCs Disabled	PCs Off	PCs On
	29.670	29.868	29.725
	29.754	29.721	29.787
	29.702	29.735	29.800
	29.708	29.700	29.767
	29.592	29.932	30.023
	29.595	29.877	29.784
	29.564	29.964	29.873
	29.685	29.926	29.794
	29.719	29.973	29.902
	29.646	29.881	29.861
Avg.	29.664	29.858	29.832
Min.	29.564	29.700	29.725
Max.	29.754	29.973	30.023

% Diff. Avg	0.000	0.655	0.567
% Diff. Min	0.000	0.460	0.545
% Diff. Max	0.000	0.736	0.904

These results show us that even for a fairly complex flowgraph with over 20 blocks, the performance penalty is around a half a percent to close to a percent in the worst case.

My takeaway from this is that we are probably being a bit over-zealous by having both the compile-time and run-time configuration of the performance counters. It also looks as though the runtime toggle of the PCs on and off seems to almost hurt more than it helps. This might make me change the defaults to have the PCs enabled at compile time by default instead of being disabled. For users who want to get that half a percent more efficiency in their graph, they can go through the trouble of disabling it manually.

Performance Counter Performance

GNU Radio Conference 2016