Understanding Cell Phone Technology: Lossless Data Compression
Claude Shannon proved that at least one code exists that allows data to be sent across a communications channel at rates up to the channel capacity C, assuming the entropy of the source is less than C. Unfortunately, his theory doesn’t completely explain what the code(s) might be.
This week we’re going to begin talking about the first kinds of codes, source codes. These codes compress the overall amount of data to be sent in smart ways. They squeeze out some of the information from the message so we can transmit large amounts (measured in bits per second) of data across the capacity-limited channels of our cell phones.
Compression algorithms come in two main flavors. Today we’re going to talk about “lossless” compression, which squeezes out only redundant information. Net week we will talk about “lossy” compression, which removes not only the redundant information, but some additional “least significant” information.
Morse code was the world’s first example of lossless data compression for electrical communications. When we first met Morse code, we learned that it was designed to reduce the number of dots and dashes that had to be sent. It did this by assigning shorter codes to higher probability letters. For example, a single Morse dot was assigned to E, the most frequently occurring letter. Now that we understand a little about information theory, we can say that Morse code uses the entropy of the English alphabet to reduce the amount of data that has to be transmitted.
Another example of such entropy-based lossless coding is Huffman coding. Huffman codes are special. They are “prefix” or “instantaneous” codes, meaning that no codeword is a prefix of any other codeword. It can be shown that Huffman codes are the optimal (that is, of shortest expected codeword length) prefix code for a given distribution.
Let’s look at how to create a Huffman code. Suppose my friend and I devise a secret code for communicating. We only use 5 symbols in this code – the letters A, B, C, D, and E. Just like in the case of Morse code, I can do a frequency count on the letters which appears in the messages I send and receive. Suppose when I do this, I find that the probability of an A occurring is 0.25, of a B is 0.25, of C is 0.2, of D is 0.15, and E is 0.15.
To design my Huffman code for compression, I write down the symbol and its associated probability in the first two columns of my table below. As I work across each successive column, I combine the two lowest probabilities, assign a bit value of (0) to the first and (1) to the second, and write the sum of the probabilities on the higher line in the next column. Then 0.15 + 0.15 in column 2 become 0.3 in column 3. I continue this process until I’m left with a single entry of 1.0. Then the codeword assigned to each value of x is given by reading off the bit values in parentheses backwards across the table.
Symbol column 2 column 3 column 4 column 5 codeword
A 0.25 0.25 0.25 (0) .55 (0) 1.0 00
B 0.25 0.25 (0) 0.45 0.45 (1) 10
C 0.2 0.2 (1) 11
D 0.15 (0) 0.3 0.3 (1) 010
E 0.15 (1) 011
OK, that wasn’t so hard. Now let’s suppose we don’t know the probabilities of the input values ahead of time. Is it still possible to compress the source information?
Yes, it is. There are things called “universal” codes designed just for this situation. A very common lossless universal code is the Lempel Ziv (LZ) code, which shows up in all sorts of data compression schemes. LZ is used in the COMPRESS and GZIP computer commands, and in image formats like GIF and TIFF.
How does it work? Suppose we consider the string 1011010100010. The first step requires you to parse the string into smaller strings which have not previously occurred, and index it. The codeword sent includes the index describing the prefix seen previously, and the new bit added to the string. So then we can build the code on the fly as
Parsed string begin 1 0 11 01 010 00 10
Index 0 1 2 3 4 5 6 7
Binary index 0 001 010 011 100 101 110 111
Codeword ( ,1) (0,0) (01,1) (10,1) (100,0) (010,0) (001,0)
Therefore, the encoded string ends up looking like all the codewords run together: 100011101100001000010. If you think about applying these rules to this string, you can actually see how to break it up in codewords as 1/00/011/101/1000/0100/0010.
It may look like this coding scheme ends up adding lots of extra bits. But using some advanced math, it can actually be shown that the LZ code can asymptotically (that is, for long data inputs) compress the information down to the entropy of the source.
Morse code, Huffman codes, and the LZ codes are examples of lossless data compression. They squeeze out only the redundant information so the receiver can completely reconstruct the original message. Next week, we will meet some lossy data compression schemes, which end up actually throwing away some of the information content before transmission!