Blog

FOSS SDR at W@VT

June 8, 2015

Hey, that title's not annoying at all, is it? This post discusses our forum on Free and Open Source Software (FOSS) for Software Defined Radio (SDR) at the Wireless@VT (Virginia Tech) Symposium a couple of weeks ago.

We brought together five speakers from different areas to talk about how FOSS works in their world of radio and SDR. I talked about the GNU Radio project, Philip Balister spoke on OpenEmbedded and how we use it for SDR development on embedded systems, Tim O'Shea from the Hume Center of Virginia Tech talked on their use of FOSS in research in both wireless and machine learning, Rick Mellendick from Signals Defense spoke about how FOSS has enabled work in wireless security and penetration testing, and Ben Hilburn from Ettus Research was there to speak about his perspectives on FOSS from an industry point of view.

The main intent behind this tutorial was to expose the audience to a lot of projects and tools as well as ideas that the FOSS world can offer. The various perspectives on this were to showcase how wide-reaching FOSS is in technology, business, concepts, and intent. Essentially, we wanted to help add to the discussion about how the tools and technology can impact work in various fields.

There is a lot of misunderstanding about FOSS in our world. In the world generally, sure, but I am specifically talking about wireless and signal processing where we are not brought up on open source. We see the misunderstanding as well as mistrust of it from traditional engineers and engineering. It is often seen as something that's fun as a toy or hobby, but not for real work. Plus the big question of monetization. In the five talks in the panel, I think we exposed a lot of powerful FOSS technology and projects as well as explained some of the behavior and philosophy of FOSS. You can download the presentations in the list of talks below.

Tom Rondeau: The GNU Radio Ecosystem
Philip Balister: Software Development for Embedded GNU Radio Applications
Tim O'Shea: Leveraging GNU Radio and FOSS for Academic Research
Rick Mellendick: FOSS in RF Penetration Testing
Ben Hilburn: FOSS and Industry

I also really want to apologize publicly again to Ben Hilburn for running out of time. I completely misjudged how much time we had available, but he's been gracious enough to provide his slides here.

What's the right way to calculate tanh?

September 1, 2014

I'm working on improving the synchronization of M-PSK phase-locked loops, which we call the Costas loop in GNU Radio. I'll post later about what I'm doing and why, but the results are looking very satisfying. However, during this work, I need to calculate a tanh. In my simulation proving the algorithm, I did it all in Python and wasn't concerned about speed. Moving this to a GNU Radio block and the thought of calling a trigonometric function from libm gave me chills. So I thought that I'd look into faster approximations of the tanh function.

Spoiler alert: this is a lesson in optimization and never trusting your gut. The standard C version of tanh is amazingly fast these days and the recommended function to use.

So what are our options for calculating a tanh? I'll go through four here, some of which I had help from the #gnuradio IRC chatroom to explore.

Calling tanhf(beta) straight from libm.
Calculating from an exponential: (expf(2*beta)+1) / (expf(2*beta)-1)
Using a look-up table (LUT)
An expansion approximation from stackexchange (the last equation from July 18, 2013)

There are, I'm sure, other algorithms that will calculate a tanh. I know that we can use the CORDIC algorithm in hyperbolic mode, though my tests of the CORDIC algorithm, while appropriate in hardware, are not good in software. However, we'll see that we don't really need to go to great lengths here before we can be satisfied.

The code I used to test is rather simple. I time using GNU Radio's high_res_timer to get the start and end time of a loop. The loop itself runs for N floating point samples, and each input sample is calculated using the standard random() function. I want to make sure each number into the function under test is different so we don't benefit from caching. Just putting in 0's makes the test run almost instantaneously, because of the known special case that tanh(0) is 0.

Within each loop iteration, it calculates a new value of beta, the input sample, and one of the four methods listed above. The tests are performed on an Intel i7 870 processor running at 2.93 GHz as well as an ARMv7 Cortex-A9 (using an Odroid-U3) at 1.7 GHz. Tests were run with 100 million floats and compiled using G++ with -O3 optimization. Download the code here.

Results 1

In the following tables of results, the "none" category is the baseline for just calculating the random samples for beta; there is no tanh calculation involved. All values are in seconds.

Intel i7 870

none: 0.682541
libm: 0.700341
lut: 0.93626
expf: 0.764546
series approx: 1.39389

ARMv7 Cortex-A9

none: 6.12575
libm: 6.12661
lut: 8.55903
expf: 7.3092
series approx: 8.71788

Results 2

I realized after collecting the data that each of the three methods I developed, the LUT, exponential, and series approximation method were all done as function calls. I'm not sure how well the compiler might do with inlining each of these, so I decided to inline each function myself and see how that changed the results.

Intel i7 870

none: 0.682541
libm: 0.700341
lut: 0.720641
expf: 0.81447
series approx: 0.695086

ARMv7 Cortex-A9

none: 6.12575
libm: 6.12661
lut: 6.12733
expf: 7.31014
series approx: 6.12655

Discussion

The inlined results are much better. Each of the approximation methods now starts to give us some decent results. The LUT is only 256 values, so the approximation is off by up to about 0.01 for different inputs, and it still doesn't actually beat libm's implementation. Calculating with expf was never going to work, though it is surprisingly not bad for having that divide in it. Raw comparisons show that libm's expf is actually just by itself slower than tanh, so we were never going to get a win here.

Finally, the series approximation that we use here is pretty close, maybe even somewhat faster. I didn't run these multiple times to get an average or minimum, though, so we can only say that this value is on par with the libm version. I also haven't looked at how far off the approximation is because the libm version is so fast comparatively that I don't see the need, and this is true for both Intel and ARM processors.

I'm surprised, though pleasantly so, that for all of this work, we find that the standard C library implementation of the "float tanhf(float x)" function is about as fast as we can expect. On an Intel processor, we computed 100 million floats, which added 17.8ms to the calculation of just the random numbers, which means that each tanh calculation only took 178 ps to compute. I actually have a hard time wrapping my head around that number. I'm sure that the compiler is doing some serious loop optimization seeing as this is the only thing going on inside here. Still, until I see this being a problem inside of a real algorithm, I'm satisfied with the performance to just use the standard C library's tanhf function.

Further Discussion

One of the big problems with benchmarking and providing results for general purpose processors is that you know that the results are so directly tied to your processor version and generation. Using a Macbook Pro from 2014 with a newer i7 processor gave similar relative results but with much better speeds, even though it only runs at 2.6 GHz instead of 2.93 GHz. But that's technology for you. The important lesson here is that the processor and compiler designers are doing some tremendous work making functions we used to never even think about using wonderfully fast.

But the other thing to note is that YMMV. It used to be common knowledge that we want to use look-up tables wherever possible with GPPs. That seems to have been true once, but the processor technology has improved so much over memory and bus speeds that any memory access, even from cache, is too slow. But this means that if you are running an older processor, you might suffer a bit extra based on these results off more current computers (then again, the Intel tests are from a processor that's roughly three years old). What is really noteworthy is that the ARM showed similar trends with the results, meaning we don't have to think about doing anything differently between ARM and Intel platforms. Still, you might want to bechmark your own system, just in case.

We could also think about using VOLK to code up each of these methods and let the VOLK profiler take care of finding the right solution. I'm against this right now only in that we don't have good measurements or even concerns about the approximations, like with the LUT method. We are not only trading off speed, but also a huge amount of accuracy. In my application, I wasn't concerned about that level of accuracy, but VOLK will and should for other applications.

Finally, one thing that I didn't test much is compiler flags. The tests above used -O3 for everything, but the only other setting I tested was using the new -Ofast, which also applies the -ffast-math. I found that some versions of the algorithm improved in speed by about 10ms and others reduced in speed by about 10ms. So it didn't do a whole lot for us in this instance. There may, of course, be other flags that we could look into for more gains. Similar mixed results were found using Clang's C++ compiler.

I'd like to thank Brian Padalino, Macus Leech, Tim O'Shea, and Rick Farina for chatting about this on IRC and poking me with new and different things to try.

Update on 2014-09-02 14:15 by Tom Rondeau

Brian Padalino got back to me this morning after having similar trouble with the fact that it looked like tanh was being computed in picoseconds, except he went and did something about it. I thought that by recalculating a random result for beta in every iteration of the loop that I was preventing anything from caching, since the tanh functions would always have to operate on a new value. However, I'm guessing that if I looked into and parsed the assembly output for my code, I would see some optimizations related to the fact that the loops were just getting a new random number and calling the tanh functions on it. There was definitely some optimizations happening here that were not helpful to us understanding these calculations in a real-world piece of code.

Instead, Brian added the C/C++ keyword volatile to the x, y, and z variables in the code. This prevents the compilers from optimizing the use of these variables, which means that whatever cleverness was happening above wouldn't anymore. The numbers now start to look much more reasonable from our expectations. But also, I assume, more reasonable in how the tanh function would really be called in our applications.

This is a really good lesson for me since I've struggled with this concept of benchmarking code like this and worrying about over-optimization by the compiler. So the new results are:

Intel i7 870

none: 0.774824
libm: 4.94122
lut: 0.92123
expf: 2.48623
series approx: 1.61947

ARMv7 Cortex-A9

none: 6.72212
libm: 19.2388
lut: 8.88899
expf: 17.2966
series approx: 10.8309

So the LUT and series approximation make a big difference now. Somewhat disappointing to see this after yesterday's post, but I believe these numbers are more representative.

Approximation Error

Let's now look at how good each of these approximations are. The mean difference between the approximations and the standard tanh are:

lut: 0.00317092
exp: 1.53656e-8
ser: 2.48358e-8

As expected, the LUT performs the worst since it wasn't designed with any high accuracy, and the series approximation is very, very good. It should come as no surprise that the exponential equation is as good as it is since it's just an alternative form of the tanh.

And below is what the errors look like. Again, not surprisingly, the LUT version dominates the errors. And since the LUT was somewhat arbitrarily kept between -2 and +2, the discontinuity there is expected. For the application I have in mind, the values should be within this range, anyways. A look at just the series approximation is shown as well so we can just see that one.

Peer Review of a DySPAN Paper

April 23, 2014

One of the technology papers at DySPAN that caught my attention was called "Reconfigurable Wireless Platforms for Spectrally Agile Coexistence" by Rohan Grover, Samuel J. MacMullan, and Alex Wyglinski. Their interest is in providing OFDM waveforms with subcarriers cut out in such a way that the resulting spectrum hole is still deep enough to allow for another radio to use it. Normally, just nulling out subcarriers leaves a lot of power from the side-lobes of the other carriers. So what they suggested instead was the use of IIR filters to provide cheap, sharp notches at these nulled-out subcarriers. The paper explains the development of the IIR filters, in which they have a subset of pre-built and stable filters to meet their performance requirements. They select the a set of filters to use and combine them to provide band notching. Read the paper for more details about what, why, and how.

My interest was to see if this scheme would really work and how well. I figured that this would be relatively easy to replicate in GNU Radio, so I went to work. The main problem that I had was that we don't focus on IIR filters in GNU Radio. IIR filters provide too much phase distortion and the lack of SIMD versions of the filters plus the fact that FIR filters are easy to pipeline and implement with FFTs means that we get very good performance and filtering just using FIR filters in software. However, for this, I was going to need an IIR filter that takes complex inputs and outputs, which we didn't have. GNU Radio only had a float in and float out IIR filter. So I went in and fixed this. We now have more IIR filter options for dealing with complex data types and taps. Aside from struggling with C++ templates (because we end up having to specialize pretty much everything), this wasn't that hard to do.

I then put together a simple simulation with our OFDM transmitter and receiver blocks. I put the IIR filter on the output of the transmitter and popped open our gr_filter_design tool. The paper doesn't give exact specs for the IIR filters except that they were trying to create a 12-tap filter, but now having the actual specs doesn't exactly matter here. So I designed my filter as an elliptic high pass filter with the end of the stop band at 0.1, the start of the pass band at 0.125, a max loss in the pass band of 0.1 dB, and an out-of-band attenuation of 100 dB. These frequency values are normalized to a sample rate of 1.

The 12-tap filter looks like this in frequency magnitude and phase:

It was the phase response of the IIR filter that first gave me pause as a way to go about filtering OFDM signals since it would distort the phase of the subcarriers throughout. I then realized that the OFDM's equalizer should take care of that, no problem, but I still wanted to test the idea.

The paper puts an IIR filter on both the output of the transmitter and the input of the receiver, both to block transmitted signals in that band to minimize the interference caused to other uses as well as to filter out external signals being generated. I just put this on the output of my transmitter to test the basic concept. Here's the flowgraph I used for testing.

Notice here that I include a fading model as well as our hardware impairments model. Paul Sutton wanted to know what would happen to the filtered signal once it passed through real radio hardware -- would the IIR filter really still have the same effect? Below is the PSD of the signal at three different stages. The blue is the output of the original OFDM signal with 128 subcarriers with subcarriers -10 to +10 turned off. The red line shows the output after the IIR filter, where we can see it notching out those middle 20 subcarriers dramatically. And the magenta is after going through the hardware impairments model doing the minimal amount of signal distortion possible. We can see that even with a near perfect hardware model that the middle subcarriers are no longer as suppressed as they originally were.

So that's really our best case scenario when dealing with a real radio. Next, I turned up the IIP3, IIP2, and phase noise distortions just a bit (we'd have to calculate what "a bit" really means for a real hardware system; right now, we just have ratio values to adjust and play with). This brings the out-of-band (OOB) emissions level back up near to the original. But notice that we are still ahead of the game, at least, and our receiver is receiving the packets just fine.

I then added the channel model with some noise and a Rayleigh fading model. Here we can see that the noise is dominating in this case, but again, my receiver showed that it was still receiving packets.

So conceptually, this is certainly possible and should provide some measure of performance improvement. Given the results shown here, it's not much of a leap to think about the IIR filter being applied to the receiver, which would cause a huge notch in any received signal in those frequencies. So from the point of view of the receiver, we can use this to avoid receiving interference on those subcarriers. With the hardware impairments model, we'd need better translation of the values used to real-world radios. So let's take a look at this with a real radio.

Over-the-Air

I took the same OFDM signal and transmitted it using a USRP N210 with a WBX daughterboard. I'm using a quiet piece of spectrum near me around 900 MHz and kept the transmission power low to avoid any transmission non-linearities. Without the IIR filter, this is what I received using another USRP N210 wth WBX:

Now here we have the same setup only with the added IIR high pass filter:

I have to say that that is much better than expected. We basically brought the signal down to near the noise floor. We have a DC term that's generated in the receiver, but that's to be expected and wouldn't interfere with another signal as is the purpose of this idea.

Finally, becaues of the success above, I decided to put another IIR filter on to the signal, but this time as a low pass filter to get rid fo the high OOB signals on the outside of this spectrum. Again I used the gr_filter_design tool to create an elliptic IIR low pass filter with the end of the pass band at 0.80, the start of the stop band at 0.90, a 0.1 dB max pass band loss, and a 100 dB out of band attenuation. This produced a 9-tap filter that I put in-line with the other filter on the OFDM transmitter. The resulting spectrum provides quite a bit of OOB suppression:

Wrap-up

This was a fun little project, and I was pleased that I could so easily take a paper and reproduce it in GNU Radio to prove the basics. It looks like the idea presented here should provide some good OOB suppression and produce more usable spectrum around otherwise hostile OFDM signals.

The use of the hardware impairments model was a nice way to see how different radio problems could affect the concept here, too. Without accounting for these effects, the simulations are not entirely meaningful, and we can then see how much it will take when building a radio to meet the specifications to cancel out effects of the filtering stages. On the other hand, the WBX daughterboard with an Ettus USRP N210 showed great performance with this signal and the added filters. I was able to free up quite a lot of spectrum by filtering at the transmitter using these radios. Perhaps lesser radios wouldn't have behaved so well, but that's for another time.

To Use or Not to Use FFT Filters

February 27, 2014

I've talked in various presentations about the merits of fast convolution, which we implement in GNU Radio as the fft_filter. When you have enough taps in your filter, and this is architecture dependent, it is computationally cheaper to use the fft_filter over the normal fir_filters. The cross-over point tends to be somewhere between 10 and 30 taps depending on your machine. On my AVX-enabled system, it's down around 10 taps.

However, Sylvain Munaut pointed out decreasing performance of the FFT filters over normal FIR filters when decimating a high rates. The cause was pretty obvious. In the FIR filter, we use a polyphase implementation where we downsample the input before filtering. However, in the FFT filter's overlap-and-save algorithm, we filter the input first and then downsample on the output, which means we're always running the FFT filter at full rate regardless of how much or little data we're actually getting out of it.

GNU Radio also has a pfb_decimator block that works as a down-sampling filter and also does channel selection. Like the FIR filter, this uses the concept of polyphase filtering and has the same efficiencies from that perspective. The difference is that the FIR filter will only give you the baseband channel out while this PFB filter allows us to select any one of the Nyquist zone channels to extract. It does so by multiplying each arm of the filterbank by a complex exponential that constructively sums all of the aliases from our desired channel together while destructively cancelling the rest.

After the discussion about the FIR vs. FFT implementation, I went into the guts of the PFB decimating filter to work on two things. First, the internal filters in the filterbank could be done using either normal FIR filter kernels or FFT filter kernels. Likewise, the complex exponential rotation can be realized by simply multiplying each channel with a complex number and summing the results, or it could be accomplished using an FFT. I wanted to know which implementations were better.

Typically with these things, like the cross-over point in the number of taps between a FIR and FFT filter, there are going to be certain situations where different methods perform better. So I outfitted the PFB decimating filter with the ability to select which fitler and which rotation structures to use. You pass these in as flags to the constructor of the block as:

False, False: FIR filters with complex exponential rotation
True, False: FIR filters with the FFT rotator
False, True: FFT filters with the exponential rotator
True, True: FFT filters with the FFT rotator

This means we get to pick the best combination of methods depending on whatever influences we might have on how each performs. Typically, given an architecture, we'll have to play with this to understand the trade-offs based on the amount of decimation and size of the filters.

I created a script that uses our Performance Counters to give us the total time spent in the work function of each of these filters given the same input data and taps. It runs through a large number of situations for different number of channels (or decimation) and different number of taps per channel (the total filter size is really the taps len times the number of channels). Here I'll show just a handful of results to give an idea what the trade-off space looks like for the given processor I tested on (Intel i7-2620M @ 2.7 GHz, dual core with hyper threading; 8 GB DDR3 RAM). This used GNU Radio 3.7.3 (not released, yet) with GCC 4.8.1 using the build type RelWithDebInfo (release mode for full optimization that also includes debug symbols).

Here are a few select graphs from the data I collected for various numbers of channels and filter sizes. Note that the FFT filter is not always represented. For some reason that I haven't nailed down, yet, the timing information for the FFT filters was bogus for large filters, so I removed it entirely. Yet, I was able to independently test the FFT filters for different situations like those here and they performed fine; not sure why the timing was failing in these tests.

We see that the FIR filters and FFT filters almost always win out, but they are doing far fewer operations. The PFB decimator is going through the rotation stage, so of course it will never be as fast as the normal FIR filter. But within the space of the PFB decimating filters, we see that generally the FFT filter version is better while the selection between the exponential rotator and FFT rotator is not as clear-cut. Sometimes one is better than the other, which I am assuming is due to different performance levels of the FFT for a given number of channels. You can see the full data set here in OpenOffice format.

Filtering and Channelizing

A second script looks at a more interesting scenario where the PFB decimator might be useful over the FIR filter. Here, instead of just taking the baseband channel, we use the ability of the PFB decimator to select any given channel. To duplicate this result, the input to the FIR filter must first be shifted in frequency to baseband the correct channel and then filtered. To do this, we add a signal generator and complex multiply block to handle the frequency shift, so the resulting time value displayed here is the sum of the time spent in each of those blocks. The same is true for the FFT filters.

Finally, we add another block to the experiment. We have a freq_xlating_fir_filter that does the frequency translation, filtering, and decimation all in one block. So we can compare all these methods to see how each stacks up.

What this tells us is that the standard method of down shifting a signal and filtering it is not the optimal choice. However, the best selection of filter technique really depends on the number of channels (e.g., the decimation factor) and the number of taps in the filter. For large channels and taps, the FFT/FFT version of the PFB decimating filter is the best here, but there are times when the frequency xlating filter is really the best choice. Here is the full data set for the channelizing experiments.

One question that came to mind after collecting the data and looking at it is what optimizations FFTW might have in it. I know it does a lot of SIMD optimization, but I also remember a time when the default binary install (via Ubuntu apt-get) did not take advantage of AVX processors. Instead, I would have to recompile FFTW with AVX turned on, which might also make a difference since many of the blocks in GNU Radio use VOLK for SIMD optimization, including AVX that my processor supports. That might change things somewhat. But the important thing to look at here are not absolute numbers but general trends and trying to get a feeling for what's the best for your given scenario and hardware. Because these can change, I provided the scripts in this post so that anyone else can use them to experiment with, too.

DySPAN 2014 Announcements

September 17, 2013

I'm really pleased to once again be serving on the Technical Program Committee for the upcoming IEEE DySPAN conference. They have announced the call for papers, due Nov. 1, and are starting to get the program up and going.

Now, if you've talked to me about DySPAN as a conference, I've been somewhat critical of it in the recent years. Partly, it seems to have lost a bit of relevancy as the regulators have dictated how DSA was going to happen. For a couple of years, this suppressed a lot of innovative though in the conference. I recall that it was either in Singapore or Aachen when any spectrum sensing paper was immediately attacked for being irrelevant because the regulators were going database-driven. To me that felt like it lacked real foresight and creativity.

At the same time, I was also getting upset with the quality of the papers that I was seeing on spectrum sensing. My favorite way to describe the problem is that if you threw in the all sensing papers since the original DySPAN into a pile and pulled one at random, you wouldn't be able to tell which year it was published. Now, this probably isn't actually true, but having been involved with the community since the first DySPAN in 2005, I've seen that it's more true than not. I think the main reason for this problem is that we aren't setting up our work in a way that promotes the scientific process, and so each paper sets itself up as a new, or "novel" as the papers always say, way of doing spectrum sensing. We aren't doing much in the way of comparing techniques, building off other good ideas, or even finding fault in other approaches. It's all about the new novel approach that I would simply title a YASST: yet another spectrum sensing technique.

My other major criticism of the conference is that we haven't quite successfully integrated the policy and technical people like the conference was supposed to. The DySPAN problem is much, much bigger than technical problems, and so the conference was established to address all (or at least many) of the challenge areas. Part of that idea is to allow multiple sides to understand each other and build collaborative, multi-disciplinary approaches (and by multi-disciplinary, I don't mean an engineer working with a mathematician). Instead, I've felt like, save some very specific people that I can think of (and they mostly come out of Dublin), both sides just live in silos with very little mixing.

But, we have a new DySPAN coming up, and I've just read the briefing on the call for papers, posters, demos, and tutorials. The committee has put together the workings of an innovation-driven program. A lot of this I attribute to the very strong team they have put together who are interested and knowledgeable about both the research space and the current technology capabilities and trends. And yes, admittedly, a few of them are good friends, so. But specifically, a lot of these are among those (non-Dubliners, actually) that have an understanding and appreciation for the social, political, and economics of DSA aside from just the technical. So given what I'm seeing, I think we're going to see a really interesting conference with a lot of strong ideas.

Now, having said all of this, I admit to having been part of the problem of the culture of DySPAN in the past. While reviewing papers on the TPC, I've tried to do my part to help foster as strong a program of papers that I could. But on the other hand, when I've attended the conferences, I, like most engineers, went in and only saw the technical presentations with maybe one or two presentations from the policy track. I expect that I'll be going to this DySPAN, and my goal this time will be to focus on those policy tracks and learn as much about that area as possible.