Understanding Cell Phone Technology: Data Compression and Error Correction
Last week, we looked at noise on a communications channel. Why? Because such noise is always going to be present, and system designers have to compensate for it if the technology has any hope of working. How do they do that?
Let’s begin by remembering that any message contains information. If every bit of the information was essential to understanding the message, we would be unable to remove any content without destroying its meaning.
But a lot of messages contain redundant information. For example, the English language is redundant. We can make sense of a short English passage every if all the vowels are missing. Think of it like a game of hangman. Given the consonants and spaces between words, it’s pretty easy to fill in the vowels correctly, or at least most of them. So we can compress the information in English text, and transmit the message more compactly:
S w cn cmprss th nfrmtn n nglsh txt nd trnsmt th mssg mr cmpctly
Noise adds undesirable information to the message – information we’d like to remove. So here’s the idea: we’re going to squeeze the redundant part of the good information out of the original message, and add back in some extra information which will help us detect and correct the bad information introduced by noise.
The squeezing out is accomplished with data compression. This is known as source coding, the coding of the source information in a clever way to express the essential information in the message compactly. We already saw an example of this with Morse code. In Morse code, the entropy of the alphabet (A-Z) was used to come up with an efficient code to transmit information.
The extra information added in is accomplished with error correcting codes. This is known as channel coding, because it compensates for the noise effects of the channel.
These are all ideas from information theory, a complex subject invented by the mathematician Claude Shannon who we’ve talked about previously. Shannon’s original 1948 paper is a classic, but mathematically dense. We’ll just look at its highlights.
Recall that any message is made up of a sequence of symbols from an alphabet. These symbols can be anything, like bits, a group of bits, letters from the English alphabet, etc. A “codeword” is a group of symbols (we usually think of them as bits or a group of bits) that are intended to represent another symbol or set of symbols. For example, we already saw that in Morse code, the codeword for the letter E is a dot. Then Shannon noted that:
1. A codeword does not necessarily have to represent only one symbol from the source alphabet. For example, we could come up with a compact code to represent the two symbols “TH” that occur frequently in English.
2. In data compression (source coding), the expected length of a codeword is at least as large as the entropy of the source.
3. Using error correction (channel coding), information can be sent reliably over a channel at data rates up to the channel capacity, allowing for an arbitrarily small probability of error in transmission. Shannon further proved the existence of at least one error correcting code to accomplish this.
These ideas can be combined to show that the two-stage process of source coding and channel coding is as efficient as any other method that could be used to transmit information over a noisy channel. Furthermore, this theory demonstrates that such a source-channel code exists to transmit information reliably if the entropy of the source is less than the channel capacity.
So much for the theory. But that leaves the tricky part of “devising [the] appropriate coding schemes” to achieve data rates approaching channel capacity. This is the art and science of source and channel coding, about which we will begin talking next week.