Wednesday
Jun122013

Nearly 50 Minutes of Volk!

I want to announce that a slide show plus audio of me talking about our Volk library has been published by the IEEE Signal Processing Society here:

http://www.brainshark.com/SPS/vu?pi=zH8zQcV8dzAXPbz0

This is based on a paper and presentation to the Wireless Innovation Forum's annual Software Radio Conference. It goes over the motivation and background of Volk and into how to use it in your own projects. This should extend easily into using it with GNU Radio blocks (and there are plenty of examples of GNU Radio blocks using Volk, as well). It does not explain how to build new kernels in Volk, however. But at 50 minutes, it seemed like it was already going too long.

 

Wednesday
May082013

Performance Counter Performance

Most of my time these past few months has been taken up almost completely with the preparations for releasing 3.7 of GNU Radio. It's a huge task and has needed a lot of work and careful scrutiny.

But I managed to find some time today to look into a question that's been bugging me for a bit now. A few months ago, we introduced the Performance Counters into GNU Radio. These are a way of internally probing statistics about the running radio system. We have currently defined 5 PCs for the GNU Radio blocks, and each time a block's work function is called, the instantaneous, average, and variance of all five PCs are calculated and stored. The PCs are really useful for performance analysis of the radio, possibly even leading to an understanding of how to dynamically adjust the flowgraph to alleviate performance issues. But the question that keeps coming up is, "what is the performance cost of calculating the performance counters?"

So I sat down to figure that out, along with a little side project I was also interested in. My methodology was just to create a simple, finite flowgraph that would take some time to process. I would then compare the run time of the flowgraph with the performance counters disabled at compile time, disabled at runtime, and enabled. Disabling at compile time uses #ifdef statements to remove the calls from the scheduler completely, so it's like the PCs aren't there at all (in fact, they aren't). Disabling at runtime means that the PCs are compiled into GNU Radio but they can be disabled with a configuration file or environmental variable. This means we do a simple runtime if() check on this configuration value to determine if we calculate the counters or not.

The hypothesis here is that disabling at compile time is the baseline control where we add no extra computation. Compiling them in but turning them off at runtime through the config file will take a small hit because of the extra time to perform the if check. Finally, actually computing the PCs will take the most amount of time to run the flowgraph because of the all of the calculations required for the three statistics of each of the 5 PCs for every block.

The flowgraph used is shown here and is a very simple script:

-------------------------------------------------------------------------------

from gnuradio import gr, blocks, filter
import scipy
def main():
    N = 1e9
    taps = scipy.random.random(100)
    
    src = blocks.null_source(gr.sizeof_gr_complex)
    hed = blocks.head(gr.sizeof_gr_complex, int(N))
    op  = filter.fir_filter_ccf(1, taps)
    snk = blocks.null_sink(gr.sizeof_gr_complex)
    op.set_processor_affinity([7,])
    src.set_processor_affinity([6,])
    hed.set_processor_affinity([6,])
    #snk.set_processor_affinity([6,])
    
    tb = gr.top_block()
    tb.connect(src, hed, op, snk)
    tb.run()
if __name__ == "__main__":
    main()

-------------------------------------------------------------------------------

It sets up a null source and sink and a filter of some arbitrary taps and some length. The flowgraph is run for N items (1 billion here). I then just used the Linux time command to run and time the flowgraph. I then keep the real time for each of 10 runs.


PCs Disabled PCs On PCs Off Affinity Off

54.61 54.632 55.141 62.648

54.35 54.655 54.451 61.206

54.443 55.319 54.388 61.787

54.337 54.718 54.348 62.355

54.309 54.415 55.467 61.729

55.055 54.318 54.314 61.526

54.281 54.651 54.581 62.036

54.935 54.487 55.226 61.788

54.316 54.595 54.283 61.817

54.956 54.785 54.593 62.255
Avg. 54.5592 54.6575 54.6792 61.9147
Min. 54.281 54.318 54.283 61.206
Max. 55.055 55.319 55.467 62.648





% Diff. Avg 0.000 0.180 0.220 13.482
% Diff. Min 0.000 0.068 0.004 12.758
% Diff. Max 0.000 0.480 0.748 13.792

 

The results were just about as predicted, but also somewhat surprising, but in a good way. As predicted, having the PCs compiled out of the scheduler was in fact the fastest. If we look only at the minimum run times out of the 10, then turning the PCs off at runtime was the next fastest, and doing the computations was the slowest. But what's nice to see here is that we're talking much less than 1% difference in speed. Somewhat surprisingly, though, on average and looking at the max values, the runtime disabling performed worse than enabling the PCs. I can only gather that there was some poor branch prediction going on here.

Whatever the reasons for all of this, the take-away is clear: the Performance Counters barely impact the performance of a flowgraph. At least for small graphs. But if we're calculating the PCs on all blocks all the time, what happens when we increase the number of blocks in the flowgraph? I'll address that in a second.

First, I wanted to look at what we can do with the thread affinity concept also recently introduced. I have an 8-core machine and the source, sink, and head block take very little processing power. So I pushed all of those onto one core and gave the FIR filter its own core to use. All of the PC tests were done setting this thread affinity. So then I turned the thread affinity off while computing the compile-time disabled experiments. The results are fairly shocking. We see a decrease in the speed of this flowgraph of around 13%! Just by forcing the OS to keep the threads locked to a specific core. We still need to study this more, such as what happens when we have more blocks than cores as well as how best to map the blocks to the cores, but the evidence here of it's benefits is pretty exciting.

Now, back to the many-block problem. For this, I will generate 20 FIR filters, each with different taps (so we can't have cache hiding issues). For this, because we have more blocks than I have cores, I'm not going to study the thread affinity concept. I'll save that for later. Also, I reduced N to 100 million to save a bit of time since the results are being pretty consistent.


PCs Disabled PCs Off PCs On

29.670 29.868 29.725

29.754 29.721 29.787

29.702 29.735 29.800

29.708 29.700 29.767

29.592 29.932 30.023

29.595 29.877 29.784

29.564 29.964 29.873

29.685 29.926 29.794

29.719 29.973 29.902

29.646 29.881 29.861
Avg. 29.664 29.858 29.832
Min. 29.564 29.700 29.725
Max. 29.754 29.973 30.023




% Diff. Avg 0.000 0.655 0.567
% Diff. Min 0.000 0.460 0.545
% Diff. Max 0.000 0.736 0.904

 

These results show us that even for a fairly complex flowgraph with over 20 blocks, the performance penalty is around a half a percent to close to a percent in the worst case.

My takeaway from this is that we are probably being a bit over-zealous by having both the compile-time and run-time configuration of the performance counters. It also looks as though the runtime toggle of the PCs on and off seems to almost hurt more than it helps. This might make me change the defaults to have the PCs enabled at compile time by default instead of being disabled. For users who want to get that half a percent more efficiency in their graph, they can go through the trouble of disabling it manually.

Sunday
Jan132013

DARPA Spectrum Challenge

Announced mid last week, DARPA will be sponsoring a Spectrum Challenge. This is a huge opportunity for the radio field. For years, we have been researching issues of spectrum sharing (or Dynamic Spectrum Access (DSA)). While we've generated a lot of good ideas, we haven't yet seen these ideas take shape in real, provable systems. Too often, the work is a simulation or an experimental test bed with too many controlled parameters.

DARPA, through their Spectrum Challenge, has a chance to change this. Forcing teams to develop and compete against each other as well as other interfering radios means that we have to think about real, unexpected challenges to our ideas. We will have to develop both robust algorithms and robust systems. What will result will almost certainly be a large number of advances in the science, technology, and understanding of the coexistence of radios.

From my perspective as the maintainer of GNU Radio, this is a great opportunity for us, too. While DARPA is not mandating the use of GNU Radio, they are requiring that teams demonstrate competency in our software. And since the final challenge will be done using USRPs, I hope that many teams will continue to use GNU Radio as their platform of choice. As I said, the challenge will not only involve developing robust spectrum sharing algorithms, it will also demand robust platforms. GNU Radio is well-known, well-tested, and has an active, educated community of users, and so is a perfect platform to build upon.

As the head of GNU Radio, I will not be participating directly in the competition. I hope to be able to advise and help all teams as I am able, and I do not want to be biased by any stake I have. Personally, my stake in this competition is the advancement of the science and technology of DSA as well as the opportunity it provides for GNU Radio.

For more details of the chellenge, visit DARPA's website.

Their Q&A page is a good, quick read over the main aspects of the challenge to get up to speed.

Friday
Feb242012

Some Really Cool DSP

I finally took the time to really understand the polyphase synthesis filterbank and how to use it to reconstruct previously channelized signals. Once again, I looked to fred harris' work on the subject and have cracked it. The keys are in creating an M-to-2fs PFB channelizer, a 2-to-Pn PFB synthesizer, and perfect reconstruction filters (-6 dB at the edge of the passband).

I plan on writing up a more thurough look into this, and all of the code to do the proper synthesis filter will be going into GNU Radio soon, but for now, some pretty pictures.

First, we start off with a QPSK signal with very little noise:

Here, I split the signal up into 4 equally spaced channels. Channel 0 is around DC (0 Hz) and goes from -1000 - 1000 Hz, channel 1 is from 1000 - 3000 Hz, and channel 3 is from -3000 to -1000 Hz. Now, channel 2 actually spans from 3000 - 4000 and then wraps around to go from -4000 to -3000 Hz. These have a 1000 Hz bandwidth but are sampled at 2 samps/symbol so the channel sample rate is 2000 Hz.

I then go through the 2-to-Pn synthesizer where Pn is actually 8. This synthesis filter takes 4 channels in and produces four channels but at twice the sampling rate, so we now have a signal that is the original signal but sampled at twice the orignal rate. Notice also that it's offset by a bit in frequency in the PSD. That's a result of my plotting that I don't feel like correcting (it's related to the reason why channel 2 spans the edge of the positive and negative spectrum). Obviously, the constellation is not rotating, so we're ok.

Notice also that this signal was first split up, filtered, and then put back together again. Perfectly. That constellation should be enough proof of that. This has some huge implications to what we can do with this concept once I generalize it. This was a simple demonstration to get the algorithm and math correct, but now we can have some serious fun!

For those of you into such things, this is the filter I used for this example:

I generated it using GNU Radio firdes.low_pass_2 filter generator with these parameters:

  • Gain = 1
  • Sample Rate = 8000 Hz
  • Bandwidth = 1000 Hz
  • Transition bandwidth = 200 Hz
  • Attenuation = 80 dB
  • Window: Blackman-harris

 

Friday
Feb172012

Volk Benchmarking

Benchmarking Volk in GNU Radio

The intention of Volk is to increase the speed of our signal processing capabilities in GNU Radio, and so there needs to be a way to look at this. In particular, there were some under-the-hood changes to the scheduler to allow us to more efficiently use Volk (by trying to provide only aligned Volk call; see Volk Integration to GNU Radio for more on that). These changes ended up putting more lines of code into the scheduler so that every time a block's work function is called, the scheduler has more computations and logic to perform.

Because of these changes, I am interested in understanding the performance hit taken by the change in the scheduler as well as the performance gain we would get from using Volk. If the hit to the scheduler is less than a few percentage points while the gain from Volk is much larger, then we win.

Performance Measuring

There is a lot of debate about what the right performance measurement is for something like this. In signal processing algorithms, we are interested in looking at the speed that it can process a sample (or bit, symbol, etc.), so a time-based measurement is what we are going to look at. Specifically, how much time does it take a block to process $N$ number of samples?

If we are interested in timing a block, the next question is to ask what clock to use? And if we look into this, everyone has their own opinion on it. There's wall time, but that's suspect because it doesn't account for interruptions by the OS. There's the user and system times, but they don't seem to really represent the time it actually takes a program to produce the output; and do we combine those times or just use one of them? This also really represents a lower bound if no sharing were occurring and with no other system overhead.

In the end, I decided what I cared about, and what our users would care about, is the expected time taken by a block to run. So I'm going with the wall clock method here. Then there's the question of mean, min, or max time? They all represent different ways to look at the results. It is, frankly, easy enough to capture all three measurements and let you decided later which is important (for that matte, it would be an easy edit to the benchmark tools to also collect the user and system time for those who want that info, too).

The results shown in this post simply represent the mean of the wall time for a certain number of iterations for processing a certain number of samples for each block. I am also going to show the results from only one machine here to keep this post relatively short. 

Measurement Tools

I built a few measurement tools to both help me keep track of things and allow anyone else who wants to test their system's performance to do so easily. These tools are located in gnuradio- examples/python/volk_benchmark. It includes two Python programs for collecting the data and a plotting program to display the data in different ways. I won't repost here the details of how to use them. There's a lengthy and hopefully complete README in the directory to describe their use.

Measurement Results

For these measurements, I have two data collection programs: volk_math.py and volk_types.py. The first one runs through all of the math functions that were converted to using Volk and the second runs through all of the type conversions that were 'Volkified.' These could have easily been done as one program, but it makes a bit of logical sense to separate them.

The system I ran these tests on is an Intel quad-core i7 870 (first gen) at 2.93 GHz with 8 GB of DDR3 RAM. It has support for these SIMD architectures: sse sse2 ssse3 sse4_1 sse4_2.

I'm interested in comparing the results of three cases. The first case is the 'control' experiment, which is the 3.5.1 version of GNU Radio which has no edits to the scheduler or the use of Volk. Next, I want to look at the scheduler with the edits but still no Volk, which I will refer to as the 3.5.2 version of GNU Radio. The 'volk' case is the 3.5.2 version that uses Volk for the tests.

The easiest way to handle these cases was to have two parallel installs, one for version 3.5.1 and the other for 3.5.2. To test the Volk and non-Volk version of 3.5.2, I simply edited the ~/.volk/volk_config file and switch all kernels to use the 'generic' version (see the README file in the volk_benchmark directory for more details on this).

For the results shown below, click on the image for an enlarged version of the graph.


Looking at the type conversion blocks, we get the following graph:

Volk Type Conversion Results

Another way to look at the results is to look at the percent speed difference between the 3.5.2 versions and the 3.5.1. So this graph shows us how much increase (positive) or decrease (negative) of speed the two cases have over the 3.5.1 control case.

Percent Improvement Over v3.5.1 for Type Conversion Blocks

 

These are the same graphs for the math kernels.

Volk Math Results
 Percent Improvement Over v3.5.1 for Math Blocks

There are two interesting trends here. The most uninteresting one is that Volk generally provides a massive improvement in the speed of the blocks, the more complicated the block (like complex multiplies) the more we gain from using Volk.

The first really interesting result is the improvement in speed between the schedulers from 3.5.1 and 3.5.2. As I mentioned earlier, we increased the number of lines of code in the scheduler that make calculations and logic and branching calls. I expected us to do worse because of this. My conjecture here is that by providing mostly aligned blocks of memory, there is something happening with data movement and/or the cache lines that is improved. So simply aligning the data (as much as possible) is a win even without Volk.

The other area this interesting is that in the rare case, the Volk call comes out to be worse than the generic and/or the v3.5.1 version of the block. The only math call where this happens is with the conjugate block. I can only assume that conjugating a complex number is so trivial (the sign flip of the imaginary part) that the code for it is highly optimize already. We are, though, talking about less than 5% hit on the performance, though. On the other hand, the multiply conjugate block, which is mostly when the conjugate is used in signal processing, is around 350% faster.

The complex to float conversion is a bit more of a head scratcher. Again, though, we are talking about a minor (< 3%) difference. But stiil, that these do not perform better is really interesting. Hopefully, we can analyze this farther and come up with some conclusions as to why this is occuring and maybe even improve the performance more.