Huffman Compression
-------------------
Written for the PC-GPE and the World by Joe Koss (Shades)
Contact me on irc in #coders
Introduction
------------
Huffman Compression, also known as Huffman Encoding, is one of many
compression techniques in use today. Others are LZW, Arithmetic Encoding, RLE
and many many more. One of the main benefits of Huffman Compression is how
easy it is to understand and impliment yet still gets a decent compression
ratio on average files.
Many thanks to Al Stevens for his Feb. 1991 article in Dr. Dobb's Programmers
Technical Journal which helped me greatly in understanding Huffman Compression.
(Al Stevens's C source code for Huffman Compression is available from DDJ and
can also be found on various FTP sites with DDJ source collections.)
Overview
--------
The Huffman compression algorythm assumes data files consist of some byte
values that occur more frequently than other byte values in the same file.
This is very true for text files and most raw gfx images, as well as EXE and
COM file code segments.
By analyzing, the algorythm builds a "Frequency Table" for each byte value
within a file. With the frequency table the algorythm can then build the
"Huffman Tree" from the frequency table. The purpose of the tree is to
associate each byte value with a bit string of variable length. The more
frequently used characters get shorter bit strings, while the less frequent
characters get longer bit strings. Thusly the data file may be compressed.
To compress the file, the Huffman algorythm reads the file a second time,
converting each byte value into the bit string assigned to it by the Huffman
Tree and then writing the bit string to a new file.
The decompression routine reverses the process by reading in the stored
frequency table (presumably stored in the compressed file as a header) that
was used in compressing the file. With the frequency table the decompressor
can then re-build the Huffman Tree, and from that, extrapolate all the bit
strings stored in the compressed file to their original byte value form.
The Frequency Table
-------------------
The Huffman algorythms first job is to convert a data file into a frequency
table. As an example, our data file might contain the text (exluding the
quotation marks): "this is a test"
The frequency table will tell you the frequency of each character in the file,
in this case the frequency table will look like this:
Character | Frequency
---------------------
t | 3 times
s | 3 ..
<space> | 3 ..
i | 2 ..
e | 1 ..
h | 1 ..
a | 1 ..
The Huffman Tree
----------------
The huffman algorythm then builds the Huffman Tree using the frequency table.
The tree structure containes nodes, each of which contains a character, its
frequency, a pointer to a parent node, and pointers to the left and right child
nodes. The tree can contain entries for all 256 possible characters and all 255
possible parent nodes. At first there are no parent nodes. The tree grows by
making successive passes through the existing nodes. Each pass searches for two
nodes *that have not grown a parent node* and that have the two lowest
frequency counts. When the algorythm finds those two nodes, it allocates a new
node, assigns it as the parrent of the two nodes, and gives the new node a
frequency count that is the sum of the two child nodes. The next iterations
ignores those two child nodes but includes the new parent node. The passes
continue until only one node with no parent remains. That node will be the root
node of the tree. The tree for our example text will look like this:
t ----[3]---------------------\
|
s ----[3]-\ |
-[6]--------------- ------\
Discuss this article in the forums
See Also: © 1999-2011 Gamedev.net. All rights reserved. Terms of Use Privacy Policy
|