Monday, November 30, 2015

Engineering a World-Class Reed Solomon Algorithm

A lot of what we do at Lambda Prime falls into the category called "Data Science."  However, we do all sorts of work, as long as it is boundary-pushing.  A couple years ago, we were tasked to find a solution for a difficult communication problem.  Today, as Internet of Things Low Power Wide Area Networks (IoT LPWANs) are starting to proliferate, it turns out this is a very relevant piece of technology.

In summary, we needed to develop a way to correct long blocks of errors in wireless communication, caused by bursty, high power interferers transmitting nearby a low data rate, low power receiver.  NASA already solved this with Voyager, but they had a ground station full of 80’s era supercomputers (still impressive), and we had a 16MHz CPU with 4KB of RAM.  We were able to build a Reed Solomon codec (coder+decoder) that ran on the constrained hardware, and in fact could stream decoding up to 1Mbps versus NASA's roughly 115kbps data rate. 

Here's Our Problem

There are high power, but short-duration interfering spikes near the receiver.  The noise floor of our receiver is around -143 dBm.  The RX signal we care about is around -117 dBm, and it can last as long as about 500 ms.  Bursty spikes are -23 dBm, last about 10 ms, and completely wipe-out our RX data.  The chart above is from a controlled test environment, but in practice the bursty spikes are more common, as these bursty spikes come from multiple transmitters and don’t decay to acceptable levels until 1km away from our receiver.  We need a way to repair the data we lose due to interference from the bursty spikes.

First, Some Background on Reed Solomon (RS) Coding

There are many resources online you can use to learn about RS coding.  There are a few basic things to know:
  1. It is an encoding to use for error-correction in data
  2. Unlike most error-correction methods, it can correct long blocks of errors.  Most methods can correct a certain number of bits per window.  RS coding, on the other hand, can correct wide gaps of completely missing (or errant) data.
  3. If you combine RS coding with a convolutional approach (i.e. correct X bits per Y window), it can yield a limit-approaching error correction scheme.

On point 3, there are now superior error correction schemes in the royalty-free domain, like Turbocodes and LDPC codes, but they have two major drawbacks:
  1. They can't easily be made to work with adaptive length data (e.g. variable length packets with a length byte at the front). 
  2. Many IoT LPWAN system-on-chip (SoC) platforms contains Viterbi decoding in the hardware (e.g. SPIRIT1, CC1200, CC1310).  Viterbi decoders do the outer convolutional coding we need to pair with the inner RS coding.

Then, We Did Some Preliminary Homework

So, we did extensive research and determined that we needed to implement an RS codec that could run inside the limitations of the platform, which was 4KB of RAM on a 16 MHz ARM Cortex M3.  We also did extensive research on RS coding itself, and existing implementations.  It turned out that no one had ever gone public with an adaptive length Reed Solomon codec, but we determined it was possible, and we went forward with the idea, leveraging the foundational research that NASA had done in the early 80's.  (This is a great story in itself, how they built an encoder into Voyager without having engineered the ground station decoder until years later.  We built ours in six weeks, but of course we had the advantage of being able to read their papers!)

We ended up going with the same, basic code that Voyager used, except we made ours work with adaptive length packets.  In other words, Voyager works on 255 byte blocks of data, with 223 data bytes and 32 bytes for the RS coding overhead.  Our RS codec can work with any amount of data bytes between 4 and 223, and it will generate a 4 to 32 byte overhead block depending on the data length.

Next, We Worked on the Algorithm Architecture

In the 30 years since RS coding was used in Voyager, a lot of follow-up research has been done on the fastest way to do the decoding.  For these sorts of error-correction algos, decoding is always much harder than encoding.  The fastest known way to do RS decoding is by use of Walsh Transforms [F Didier], which runs in O(N log^2 N) time, but it only works with erasures, not errors (an erasure is an error where the location is known a-priori).  The fastest error correcting code uses a Fast Fourier Transform method [S Guo], but it only really works on a 65535 word block, using 17 bit words.  It has O((N log^2 N) loglog N) performance, but, as mentioned, the block sizing is inconvenient from a data perspective, and it requires way too much RAM for our platform.  So, we were set with the task of taking the traditional, slow, Berlekamp-Massey algorithm -- typically pegged at O(N^3) [H Minsky] -- and doing the necessary engineering to make it fast.

Finally, We Built It

We used the Walsh Transform implementation from Frederic Didier as a benchmark.  On the Core i7 laptop used for testing, it would achieve about 220MB/s decoder throughput for erasure correction.  We wanted to get half way there for error correction, using an algorithm known to be slower.  There's a decent piece of reference code for traditional RS coding here [H Minsky], it does about 1.3 MB/s on reference hardware, so it was also a benchmark on the low side.  At the end of the design phase, we ended up building an algorithm that merged all three pieces of RS error correction -- Berlekamp-Massey error detection, Omega-Poly data correction, and Chien's Search error replacement -- into a single, in-place, monolithic block of code.  We also extricated the syndrome computation element and implemented it as a pre-cursor to the formal error correction.  In practice, this shaves-off time because the Walsh and FFT methods can't do any computations while the data is being actively received (they must wait until the whole packet is received).

*estimated: based on theoretical comparison with known benchmarks

In this table, the “worst case” throughput assumes a 256 (actually 255) byte packet, and the “typical IoT case” assumes a 48 byte packet.  In most internet of things use-cases, packets are quite short, often less than 48 bytes.  Our hybrid, adaptive algorithm behaves differently depending on the packet length (N) and the RS overhead (K), unlike the other algorithms, so in the chart below it is plotted as three coding rates (N, K settings) that relate to the rates used in practice.

Implementation Benchmarks (16 MHz ARM Cortex M3)

Worst-case throughput: 1 Mbps
Heap RAM: 272 bytes
Stack RAM: < 32 bytes

The communication system that benefits, here, uses only around 2.4kbps, but having a 
really fast decoder allows less turnaround between reception of request and transmission of response (This adds about 2 ms, worst case).  Short turnaround makes networking more efficient, and more devices can be installed per cell.  More devices per cell, in turn, allows less
infrastructure, and infrastructure is expensive.

Here is Our Syndrome Decoder

There's no way we're going to show you the monolithic error correction algorithm we built, but here's an idea about the kind of optimization we do by default at Lambda Prime.  You can notice a few things:
  • The syndrome decoder is adaptive; it can work with 4 to 32 byte correction blocks.  
  • It is unrolled.
  • The algorithm works on byte boundaries, but the decoder works on 32 bit words because our ARM ALU is 32 bits.  Note the exploitation of 32 bit Multiply Accumulate, which is a fast instruction on the target CPU (Hint: it starts with the byte spreading by multiplication with 0x01010101)