Data Compression
Copyright 1994,
Jack G. Ganssle
Abstract
Transmission
bandwidth is always limited (hey... if you're reading this over a 28.8kb link,
you get the picture!). Data compression can help a lot.
Published in
Embedded Systems Programming in two installments, February and March 1994
Senator Joe Biden
found himself in hot water for allegedly plagiarizing material during college.
The military academies regularly erupt in cheating scandals. I'm trying to
teach my six year old to do creative work in school without peering over someone
else's shoulder.
Yet, my engineering
motto is to steal constantly. I surreptitiously glance around and furtively
rip pages from the latest issues of ESP, EDN, or any of a hundred other sources.
All go in my algorithm collection. (Jack Crenshaw's articles in this magazine
are great - just last week I pulled his bit on CRCs to help a friend). I buy
books, and now CD ROMs, about obscure topics, tossing them on a shelf in case
someday I'll need to refer to them. Once or twice a year I browse the University
of Maryland's Computer Science and Math libraries, looking for neat techniques,
ideas, and methods.
As a result my
algorithm collection grows a little every month; perhaps haphazardly, perhaps
with little clear direction; but always it expands. Someday I'll have to do
a decent job of categorizing the material, but in the meantime it is the source
of many of the algorithms I use in my work. Why re-invent a FFT or floating
point routine when others have developed proven, reliable methods? Steal shamelessly!
While recently
looking for a compression routine I was shocked to find that almost none of
the dozens of articles in the collection contained any flowcharts, pseudo-code,
or code snippets. All plunge quickly into a discussion of entropy, Shannon's
information theories, and the like. The closest most come to providing something
useful is a deep discussion of creating binary trees which depict coding schemes.
Hey - I don't about you, but I don't have the time or patience to figure out
some academician's theories. I need code now!
Since many embedded
systems transmit and store data, compression is an important issue that many
designers ignore, perhaps because of the paucity of useful information. Here
I'll present a couple of quick and dirty techniques that may not provide the
best possible results, but that can be used immediately. Next month I'll follow
up with more discussion and code details. Options
Until recently
computer science had interest only in lossless compression. A lossless scheme
transmits the data perfectly; there is no degradation due to the compression/decompression
process.
Some sorts of
data can tolerate quite a bit of corruption, though. Video, in particular,
just does not require the same sort of bit-by-bit perfection a computer program
does. Lossy compression reduces the data's bandwidth in part by allowing redundant
or non-essential information to be lost.
Under the lossy
category the most fascinating approach (to me) is to reduce the image to a
series of fractals - floating point numbers that represent the complexity
of a shape, exploiting hidden regularities in the image. Often compression
ratios of thousands to one result. Though there have been some attempts to
commercialize this, the process of extracting the fractal nature of a picture
is so complex that compression times are still very high.
Lossy compression
is even today being used by Bell Atlantic to transmit full motion video over
conventional copper phone lines in Virginia. I haven't been able to find out
how they are making this happen, but the thought of a wide bandwidth link
into the homes, without rewiring the entire United States for fiber optics,
is pretty cool.
Lossy compression
is an entire field in itself that is now being studied heavily. However, my
interest lies more in lossless techniques that let me transmit a text stream
or program code intact.
All data compression
schemes exploit some pattern or characteristic of the data, so no one method
is ideal for all situations. For example, fax machines transmit mostly white
images; the bit stream is therefore compressed by sending only the differences
between scan lines.
Lots of data
sources are very repetitive in nature. Most pictures, for instance, contain
long strings of identical data. Often, computer- transmitted text has vast
amounts of white space (sequences of 0x20 characters). Run Length Encoding
(RLE) compresses data streams by simply noting the presence of identical character
sequences and replacing each sequence with the repeated character and a number
indicating the required number of repetitions.
For example,
the string "sssttttooo0o" could be compressed to something like
"3s4t5o", where the numbers indicate the number of repetitions for
each character.
RLE can actually
reduce the information-carrying capacity of a channel when the data is very
busy. A picture of static may have few repeating sequences. One solution,
in a byte-based system, is to reserve bit 7 of each byte as a flag bit indicating
"send this character though without a repeat count". Any character
with bit 7=0 will be a repeat count, always followed by the single character
to be repeated. Of course, this means that the repeat count cannot exceed
127.
The rules for
encoding this sort of RLE code are thus:
If a character(i)
is not the same as character (i+1), set bit 7 and transmit it. Otherwise,
count the number of identical characters (to a limit of 127). Drop the repeat
count followed by the character into the transmission stream.
Pseudo-code for
the encoding process is:

RLE encoding
is easy to implement and, in many cases, very efficient. See "Run-Length
Encoding Slashes Image-File Storage Requirements" by Richard Quinn, Personal
Engineering & Instrumentation New, September 1990 for C code and more
discussion of the subject.
EDN ran a more
comprehensive article in the June 21, 1990 issue: "Compression Algorithms
Reduce Digitized Images to Manageable Size", by William Warner. Highly
recommended for dealing with more complex image and transmission types. Huffman
Coding
As I mentioned,
all data compression exploits some known characteristic of the data stream.
In English text, for example, the letter "e" occurs much more often
than "z", yet ASCII assigns exactly the same number of bits (7)
to the two characters.
In 1952 D.A.
Huffman proposed using varying length codes to express characters. The resulting
code is used by most text compression systems, including the Windows Help
compiler. Again, since "e" is so common, an ideal Huffman code would
assign it just a couple of bits, giving the rare letter "z" many
more.
Huffman codes
allocate the shortest codes to the most likely characters, and the longest
to those that are infrequent. The bad news: data transmitted so encoded will
not be aligned on byte boundaries, since most characters will have far fewer
than the standard 8 bits. To avoid ambiguity, each code must have the "prefix"
property. That is, no code can occur as the first part of another, longer
code. Thus, if the letter "e" were encoded as "01", then
no other code could start with this sequence. This does allow the decoder
to properly align on a character, but also somewhat lengthens the minimum
code length.
Figure 1 gives
a set of Huffman codes for the alphabet. Note that commonly occurring letters
have fewer bits than those less common.
The compression
process uses a so-called "dictionary lookup", nothing more than
getting the character's code and size from a lookup table, and then dropping
them into the transmission stream. Pseudo code might look like the following,
assuming the codes exist in a structure <huffman>, which has
elements <code> (left justified code as in figure 1), and <size>,
which is the number of bits in the code.
WHILE (there are more characters)
{
code=huffman[character].code ; get huffman code
size=huffman[character].size ; get code size
i=0
WHILE(i<size)
{sendbit(code) ; send MSB of code
shift code left 1 bit ; select next bit
i=i+1 ; increment # bits sent
} ENDWHILE
} ENDWHILE
Decoding the
compressed transmission is a bit more complex, simply because the character's
size isn't known until a match occurs. Most approaches break the Huffman code
into a binary tree, depicted in a table with forward links to speed the searching.
While the code is not terribly complex, building the table for efficient searching
requires a bit of explanation, which I'll provide in next month's installment.
For information
about Huffman coding, consult "An Introduction to Data Compression"
by Harold Corbin (April, 1981 Byte), or "Data Compression", by Gilbert
Held, John Wiley & Sons, New York, 1983.
There are many
Huffman codes. Just because the average English-language tome contains some
particular distribution of letters does not mean that all transmitted English
data will be the same. Vast quantities of C code, for example, will show an
unusually high preponderance of curly braces. Though many Huffman users ignore
this and encode their data with the standard English-language probability
distributions, it is sometimes possible to greatly improve the compression
efficiency by computing new codes for characters based on the text block being
transmitted. This is called Adaptive Compression. It requires more compute
time to figure the codes, and uses more overhead in that the code table must
be transmitted along with the compressed data, but can result in substantially
better performance.
Finally, it's
fascinating to watch a child learn to read. A young one starts by clearly
recognizing each letter, only laboriously assembling each into a word. The
same sort of high-entropy effect occurs in Huffman encoding when we assign
short codes to characters, but ignore the fact that English abounds with frequently-used
phrases and words. In English text, "the", "to", and a
number of other words are so common it makes sense to extend the compression
beyond traditional character boundaries and assign a short code to entire
common words. The decoder will recognize the word as a single entity with
relatively few bits, just as a skilled reader recognizes words, ignoring details
of the letters. In fact, few programmers bother as the increased computation
time may not justify the results. Data Compression - Part 2
Last month I
introduced several schemes that compress text and binary data transmitted
over a communications channel. All of the methods I covered were for lossless
compression, which preserves every little nuance of the raw data. Lossy compression
is a whole 'nother field, which it's hard to say much about unless one is
willing to define exactly how much lack of fidelity is tolerable.
Huffman codes
are an easy and commonly used method to convert a string of data to tokens,
each of whose length is inversely proportional to the frequency of the encoded
character. For example, to transmit Huffman-encoded standard English text,
the token corresponding to the letter "e" uses very few bits, as
"e" is the most common character in the alphabet. The letter "z"
will surely take many more bits to send, though will not impact the communications
bandwidth much as it occurs so infrequently.
Although I presented
a frequency distribution table for standard English text last month, most
compression schemes dynamically compute the frequency of characters from each
data set. Surely you can see that transmitted C code, for instance, will have
an unusually high number of curly brackets compared to the epic Beowulf. Even
short segments of standard English text may have skewed distributions due
to the material's subject matter or the writer's skills.
It's easiest
to visualize how the Huffman algorithm translates each character to a set
of bits by using a binary tree. Indeed, most compression techniques use trees
to simplify the code and to aid in building recursive algorithms.
The figure is
a portion of a typical binary tree used to build the compression dictionary
(the list of codes for each character). Each character is depicted by the
branches one descends to reach the character at the bottom. In this simplified
tree the letter "e" uses only a single bit, a zero, for its encoding.
"t" is 10; "m" is 11.
root node
|
|
__0__________1_________________
| |
| ____0_________1__________________
E | |
| |
T M
Huffman compression
consists of the following steps:
1) Scan the input
data and develop an array of the frequency of occurrence of each character.
If you are compressing ASCII or other byte- wide data, the code looks something
like:
Repeat Till All Data Scanned
{
lCount[input_char]++
}
The lCount array
will have 256 elements, one per possible entry in the input character set.
Be sure to use a data structure that will tolerate the maximum number of possible
counts per character in the data stream. I like longs, as it's awfully hard
to overflow them on any realistic data stream.
It's a good idea
to then reduce the array of longs to an array of bytes. Most likely you'll
be handling lots of input data; it's foolish to do massive numbers of long
compares when byte versions will suffice. Since there are only 256 possible
characters, you can simply convert the array of counts to an array of relative
frequency, 0 meaning the character was never seen and 255 meaning it is the
most frequency.
2) Build the
dictionary tree. I could use tree terminology here and most likely confuse
at least half of you, Gentle Readers, so will try more of an illustrative
approach so you can really see how the tree is built. The steps are as follows,
and result in the tree shown in figure 2.
2.1) Arrange
the character set in an array ordered by decreasing frequency. Include the
frequency (i.e., the count of the number of times each character occurred).
2.2) Starting
at the bottom, "connect" the two characters with the lowest frequency.
Add up the counts for each and place these at the intersecting lines.
2.3) Continue
combining groups of two until all have merged into one.
2.4) Assign a
zero to one side of each nodal point and a one to the other (i.e., zero for
left branches and one for right branches).
The tree is all
we need to compress input data, but it is cumbersome to use. Worse, we've
built an upside-down tree, with the "input data" (the characters)
at the leaf, or bottom, nodes. Since we'll presumably compress an input array
that is much bigger than this small tree, it makes sense to add another step
to build a simpler data structure. We really want an array of Huffman codes
(with the associate number of bits per code), with one entry per character.
In other words, we can traverse the tree once per possible character and log
the resulting codes to an array HuffCode. HuffCode has 256 entries, and is
a structure that includes two elements per entry: the code itself, and an
integer indicating the number of bits used in each code.
3) Send the tree
to the receiver so that it can be used to decompress the data. Though you
can simply transmit the entire tree structure, some people reduce it to bare
elements to conserve transmission bandwidth. Much of the tree could be empty,
depending on the input stream, in which case it's silly to add the extra null
nodes when the whole point of the exercise is to reduce data!
4) Compress each
data byte. Now that the dictionary array has been built (HuffCode), transmit
data a bit at a time as follows:
WHILE (there are more characters)
{
code=HuffCode[character].code ; get huffman code
size=HuffCode[character].size ; get code size
i=size
WHILE(i)
{sendbit(code) ; send MSB of code
I<<1 ; select next bit
i-- ; increment # bits sent
} ENDWHILE
} ENDWHILE
Huffman Expansion
The receiver
must have a copy of the tree compression tree so it can decode the bit stream
properly. I suggested above that you transmit more or less the entire tree.
Some folks send the linear dictionary array (HuffCode) instead, but the code
needed to decompress a byte using only HuffCode is slow and bulky.
Expansion using
a tree is a more natural approach, particularly since you have no idea where
the character boundaries are. A stream of rapid fire bits, perhaps millions
of them, will come shooting down the transmission line with no clear demarcation
between characters. The tree captures both the characaters codes and the number
of bits needed to represent each one.
This makes decompressions
just a matter of traversing the tree. For each input bit descend the tree,
branching to the left if the input bit is a zero, or to the right branch for
a one. You'll know the expansion is complete when you hit a leaf node of the
tree. The code to decompress a single character looks like this:
TreeNode=RootNode
Repeat Until at a leaf
{
TreeNode=Tree.right ; assume right child branch
if (inputbit==0) TreeNode=Tree.left ; if 0, traverse left child branch
}
Output(TreeNode);
It's a good idea
to add code to print the intermediate data structures (e.g., the tree) for
debugging purposes. Lempel/Ziv
Why compress
single characters when often text and binary data will contain long identical
sequences of phrases? In English, the word "and" crops up constantly,
yet always burns up four ASCII characters (including the leading space): 32
bits for a word rarely unused in a paragraph.
Most of the familiar
compression utilities like PKZIP and LHARC attempt to find common phrases
rather than single characters. All of these algorithms trace their roots to
work done by Lempel and Ziv in 1977 and later.
The LZ77 approach,
Lempel and Ziv's first shot at a new method of data compression, tries to
compress a file by building a dictionary of entire phrases used before in
the data being compressed. It maintains a sliding window, a long string (several
thousand bytes) of data that has been recently processed, as well as a short
"look ahead buffer". Every time it finds a phrase in the look-ahead
buffer that matches a portion of the text in the sliding window, it emits
a pointer to the corresponding phrase in the sliding window, as well as a
number indicating how many characters matched, and what the first non-matching
character was. The Lempel/Ziv philosophy is to assume that the current data
string most likely looks a lot like the what was most recently processed.
The sliding window
becomes a dictionary whose contents are always changing as it slides over
the input data stream. The compressed output is a series of numbers telling
the decoder where to look in the sliding window for the compressed phrase.
The fact that the window slides, and is not stationary, keeps search times
reasonable.
The LZ77 algorithm
and its more recent versions can sometimes produce tremendous compression
ratios for embedded instrumentation transmitting more or less the same sorts
of data over and over. Given some a priori knowledge of your instrument's
output, it may be possible to construct an LZ sliding window and look-ahead
buffer size that greatly exploits the repeating nature of the data.
Be wary! Several
variants of the LZ algorithms have been patented, and commercial use of these
may require the payment of non-trivial royalties. It seems appalling to have
to say "consult your attorney before writing code", but perhaps
there is no other choice. For this reason alone I stay with the standard Huffman
techniques, even though not as efficient as the LZ algorithms. Resources
You can play
with all of these techniques on a PC for non-commercial purposes using the
code in "The Data Compression Book", by Mark Nelson (M&T
Books, San Mateo, CA, 1992 800-533-4372 (in CA 800-356-2002)). It comes with
a pair of disks that contain C source code for the Huffman, LZ77, and many
other compression/expansion schemes. I highly recommend it.
Also, refer to
the May, 1986 Byte for more information about handling the Huffman trees.
See "Data Compression with Huffman Coding", by Jonathan Amsterdam,
pgs 99-108.
Figure 1: A Set
of Huffman Codes
e 100
t 001
a 1111
o 1110
n 1100
r 1011
i 1010
s 0110
h 0101
d 11011
l 01111
f 01001
c 01000
m 00011
u 00010
g 00001
y 00000
p 110101
w 011101
b 011100
v 1101001
k 110100011
x 110100001
j 110100000
q 1101000101
z 1101000100