Creating a ''One to One" Adaptive Huffman Compression/Decompression Program for the real world of 8 bit byte files

by David A. Scott

Introduction

Huffman compression [3] is one of the main compression routines used in the world, but very lttle thought has been given on how to end the compression stream so that the compressed file is a multiple of 8 bits, so something special must be done in order to write out the compressed file to the real world of 8 bit byte oriented files. What are the common methods? What would be an improvement on such a method.? What is the motivation and is there a way to measure such improvement?

Fortunately this is a doable task, and the approach rasied in this article could lead others to extend this type of research into other compression methods such as Arithmetic coding, or even those of the Burrows Wheeler Transform.

Background

Hufman compression is a method by where one tries to compress the size of a file so it occupies less space in memory. As a simple example of Huffman compression assume one has a file made up of 4 symbols. 'A', 'B', 'C', and 'D'. These four symbols could be represented by 00, 01 ,10 ,11 respectively. This at first may seem like the best representation of such a file , but if one knows the ratio of occurence of such symbols in the file and wished to rewite the file so that it would be shorter and still use a fixed length encoding, one may do better. As a matter of fact that better is called Huffman coding.

The key to picking these patterns is the building of the Huffman table. The concept is quite clever. You start at the top and draw 2 arrows one to the right and one to the left. At the tip of each arrow you either draw 2 more arrows or place a symbol. When you are ready to code the symbol you start at the top and decide whether to use a 1or a 0 for the left arrow, it makes no difference which one you pick as long as you use both. One simple way to do this for a static table is to count all the occurences for each symbol Example : For a file that contains only 5 charcter types . A occurs 3 times B occurs 3 times C occurs 2 times D occurs 1 time and E occurs 1 time. right then we order them by the decreasing number of times they occured right to left
A 3, B 3, C 2, D 1, E 1 now we combine the last 2 symbols assigning a 0 to one of them and a 1 to the other, then add the resulting number from the last two symbols
A 3, B 3, C 2, (D=0, E=1) 2 or ( A 3, B 3, C 3, (D=1,E=0) 2)
next reorder the weights of remaining items in decreasing order , I am only using the first set, but the key 0 or 1 assignment is not unique and if 2 symbols have the same weight either order is OK.
A 3, B 3, C 2, (D=0, E=1) 2
now combine the last 2 symbols assigning a 0 to one of them and a 1 to the other then add the resulting number from the last two symbols. Note that when tacking on the new value we add the bit to left side of string
A 3, B 3, (C=0, D=10, E=11) 4 , note the new bits added to left side of symbols.
now reorder again
(C=0, D=10, E=11) 4, A 3, B 3
combine the last 2 symbols
(C=0, D=10, E=11) 4, (A=0, B=1) 6
since only 2 are left , there is no need to reorder, just combine
(C=00, D=010, E=011, A=10, B=11) 10
We now have one of many possible Huffman codings for the symbols used above.
In summary A=10, B=11, C=00, D=010, E=011
With the above coding any file consisiting of any order of bits 1,0 will decode into the symbols A,B,C,D,E The only possible trouble could occur when decoding if the file ends early, for the example above, if the last symbol has been decompressed and you have only one bit left, what does a person do since there is no symbol in the table above that is exactly one bit long?.

If the the ratios of the four symbols 'A', 'B', 'C', and 'D' are 4:2:1:1 , then assigning the values of 'A' = 1, 'B' = 01, 'C'=001, and 'D'=000 would be a better assignment than the previous two bit assignments.

One can easily tell by observation that if one had made a set of Huffman values for a set of 256 values, that the shortest coding for all possible symbols would be 8 bits. This is apparent since 2 to the eighth power is 256, but what are the characteristics?

Note that for a file that has this ratio 4:2:1:1, the two bit assignment would make the average file of 8 characters 16 bits long. This is calculated by the 8 characters times the 2 bits for each character. But when we use an optimal Huffman set of code for these 8 characters we get 14 bits. Since the 'A' appears an average of 4 times and it is 1 bit in length it conttributes to a total of 4, while the 'B' which appears 2 times and is 2 bits in length provides 4 bits towards the total , and the remaing two chartacters 'A' and 'B' each provide 3 bits for a total of 6 bits. Then 4 (from the A) plus 4 (from the B) plus 6 (from C and D) gives us the total length of 14 bits.

But there is still the problem that after one compresses a file, if the user wants to decompress it at a later time, he must have a table to tell which pattern of bits were used for each symbol.

This can be solved by either saving the table somewhere else or by adding the table to the file. The drawback of saving the table is that you have to remember where you saved it; the drawback of adding it to the file is that it takes a lot of space to store the complete table in it.

Forunately there is another solution to this problem. that solution is Adaptive Huffman Compression. Adaptive Huffman compression sometimes is worse than Static Huffman compression because it is not optimal in the sense that the Static Huffman compress the optimal for a fixed length coding of each symbol. While Adaptive Huffman compression creates a new table each time it reads a symbol in the input file to be compressed, however since it is not using a fixed length value for each symbol in the file, it is not constrained by the limitations of Static Huffman compression and therefore can sometimes use less space than the Static Huffman compressors; but the real advantage of Adaptive Huffman compression is that it does not need a table to be sent with the file. The problem now is how to add the information of a new symbol into the table, and how to end the file in a reasonable way if the file does not end in a complete last byte.

One to One Property

In many mathematical functions it is nice to have a mapping from one form to another in such a way that any thing in one set maps uniquely to another item in the other set. Also every item in that second set maps back to every item in the first set. A function with this property would be called 'one to one'. Most people routinely dismiss this when it comes to compressors becasue the counting theorem says that no compressor can compress all files in such a way that it makes them all smaller. This is a very powerful theorem but it can be misused, and it causes people to miss the fact that one can build a "one to one" compressor. I will now show you how to do that.

Compression is a Two-way Street

Many people tend to think of compression only from the point of view of how does one make the compressor work. Yet there is another side of the coin, how does one make an uncompressor work; It really is just looking at the other side of the same coin.

One popular way to adaptive Huffman compression is to start with no tree ( symbol table ) at all, and when the first symbol comes in, it gets sent to the output and a starting tree of ( symbol table ) consisting of ''first symbol''=1 and a ''undefined symbol''=0 is initiated, then when the second symbol is read the next output token is either the 1 or the 0 bit, zero meaning undefined symbol, the compression routine then inserts the next ''new symbol'' in the output file by writting it out and then changes its current state of the Huffman table such that ''firt symbol'=1 and "second symbol"=01 and now the ''undefined symbol''=00. This process goes on to the end of the file. And sometimes a special symbol is used to mark the end of file.

I reject this approach beacause it can't lead to a 'One to One' type of compression/decompression [2] routines, if one looks at the uncompressed program, and if the bit pattern for the next new symbol matches one that is already in use, then it leads to an inconsistency that the compressed file was compressed wrong, since the bit patteren following the 'undefined symbol' should only contain patterns that are not used yet. There is a very simple way out of this problem as we head to our goal of a 'one to one' compression routine, start with a filled Huffman tree ( each possible 8 bit character maps to a complete 8 bit character), then when first symbol comes in, you write it out and update the tree such that if a repeat occurs you have a 1 for that symbol. Note that the other 255 symbols now start with a 0 bit followed by at most 8 more bits. However one of the symbols now only has 7 bits following the 0 bit so that you are already saving space over the previuos method that is in wide spread use.

Actually both methods described above are in use. It is just that the second though to be more efficient for compression, is harder to code, and when one makes a new tree there is more work to be done, but that is what computers are for.

The File Ending Problem

At this point lets assume our quest is the 'one to one' compression routine, so we are using the method of a full tree to start with; meaning that from the very beginning we start out with 8 bits for each symbol and there are always 256 symbols defined. The question is how does one end the file so that it is 'one to one', well the fact is most people don't care about this goal since. I think, it is felt to be unattainable because of people's misconception about the 'counting theorem'. Most people end the file with some sort of check to actually give an indication of where the file ends, this can be accomplished by adding a field to the last byte to tell where compression stopped in preceeding byte, or by flat adding a byte count of the file somewhere in the last set of fixed bytes. Those methods fail the 'one to one' property I am striving for.

One can easily create an example that failes to uncompress to a valid file, it is very easy to see this failure when one tries to uncompress a random string. The uncompression routine gets on the last byte, and has not yet come up with a valid symbol based on the current Huiffman table. It does not do any good to have a set of bytes that tell where the file ends, if when decompressing the random file there is no token there. If one defines a special meaning, then one must make sure that the matching compression routines creates this file.

The Solution

The solution [1] is to create a set of rules while looking from a point of view of the uncompressing routine that one can use to make the compressor work. Since the decompression routine is always looking at 8-bit byte input files and the Huffman decompression scheme will seldom end on the filling of the last byte, here is where one must look. To understand the solution of this problem one needs to understand how the Huffman tree is built and what features we can use in it. There are more than one solution, and at my site I generate ''one to one'' adaptive Huffman compression methods that use different rules.
What follows is a set of rules that guarantee ''one to one'' compression in a Huffman file where one uses a table of all 256 leaves:

Fact one: The Huffman tree is such that the all zero token is at least 8 bits long.
Fact two: The Huffman tree is such that the shortest path form any internal node to a leaf never contains a string of more than 8 ones next to each other.
Fact three: A one byte Huffman file has no hidden bits and decompresses to one byte since the start tree has 256 leaves, all at a length of 8 bits.

Rule one: To decompress, if the tokens end in such a way that file being decompressed ends on a full byte, your are done and there is no problem.
Example ...1 11110000 positions labeled 0 to 7 and in position 3 the token S = 10000 is last decoded you have ended decompression , last symbol out is an "S".
Rule two: If the last Huffman token does not start in the last 7 bits of the last byte of the compressed file, and yet when at end of byte your at an internal node of the Huffman tree, you supply the missing ones since there are at the most 8 ones that where chopped off during the compression of that file.
Example ...1 11110000 the token E = 111100001111 was the last token used in making the compressed file and started on bit 0 of the last byte of file so the last symbol out of the decompression is "E" , the trailing ones where chopped off.
Rule three: If the last Huffman token does start in the last 7 bits of the last byte and ends at an internal node of the tree, but the portion of the token on the last byte of compressed file has at least one or more bits that are the number one bit, then as in rule two, you supply the hidden ones that where chopped off the file.
Example ...1 11110000 postion 3 is the start of the last token and in this case X = 100001, so the symbol used in the example for rule 1 could not have existed in the current Huffman tree, so the token did not end at file end and the hidden ones would be supplied and last symbol out is "X"
Rule four: ( hardest rule of all) If the last byte has the last token start in the last 7 bits of the last byte and only contains zeros on that byte . You assume that there are hidden ones that where chopped off during compression. Only if the file that is one token shorter than this one, would have had hidden 1 bits cutoff during its compression. This rule means that the all zero start of the last token in the last 7 bits could depend on what would have happened to the next shorter file, and that file status could be a function even further back in the file. It was the lack of focus on my part that made me miss this recursive rule for a while, and I kept getting more and more complicated special cases. I hope this solves this.
Example ...1 11110000 M =000011 starts in position 4, but previous token was F = 11111 , so since if file shorter by one token cut off would have occurred. therefore the M (the last appearant token) is used and it is the last charater out..
Rule five: The only case left is the case where a token starts some where in the last 7 positions of the last byte and contains all zero bits in its positions , and that the previous file had ended on the previous token would not have ones cut off during compression. In this case and only this case the 1 to 7 trailing zeros which are not a complete symbol mark the end of file and are not used to form a new character for the output file.
Example ...1 11110000 M=0011and starts in position 6 but the existence of this M is not important since F = 1100 is the previous token and no cutoff would have occured so the 00 at the end tells us that the F was actually the last character in the uncompressed file.

Future Directions

I have several versions of ''one to one'' adaptive Huffman compressors at my site, what I would like to do is to expand this to other compressors; for example, I am writting a ''one to one run length encoder adaptive Huffman compression'' program at my site. The following is a description of the methods used.

One can write a ''one to one'' Huffman type of compression routine for any code that uses 9 to 512 symbol types. Since 8-bit ascii only uses at most 256 values one could assign other leaves for control type of information. I can show you by example how this works. Think of uppercase letters as any 8 bit symbol, as soon as you compress a token replace that token with 2 tokens, this makes the tree contain 257 leaves. In other words you compress a byte just like in my old method but you create 2 new symbols. If any of the characters are not repeated, then those go away and two new symbols appear for the next character. But if the character is repeated one of those special symbols is used and then those special symbols disappear and twice as many reappear in the tree. This keeps happening untiil a regular symbol is found or the file ends. This may not seem 'one to one' since there is a limit of 512 symbols in the tree. Well, then lets only double until there are 128 such symbols and after that the total number stays the same.
You may ask, why the special symbols are used to compress, well here is why. Take a file "ABCDBBCCCDDDDDDDDEGGG" uncompressed
the number of leaves is 256 when the A is compressed.
then it is 257 when the B is compressed since A on tree replaced with A0 and A1. that is the leaf with the A is replaced by the label A0 and another leaf is added called A1, when the C is compressed the old leaves disappear A0 , A1 and B are replaced by the leaves A, B0 and B1.
when the D is read BO, B1, and D are replaced by B, D0 and D1
when the B is read D0, D1, and B replaced by D, B0 and B1
now B is read but there is no leaf for B since there is only one extra B and in compressing it I used the B0 token for B and replaced B0 and B1 with the tokens B00 B01 B10 B11 and these will not be used for the compression since the next character is a C.
the 3 C's get replaced by C C1 and the 8 D's get replaced by D D1 D00
so the compressed file from a token point of view is "A B C D B B0 C C1 D D0 D11 E G G1"
note that these are the tokens, and they will not be all of the same length since that length is whatever the path to the tree is. Now lets look at decompression, any file can be decomrpessed since there is no stream of tokens derived from the stream that do not make sense.
"A A" can not appear since once A used it is no longer a leaf in the tree.
likewise "A A0 A11 A0" is not vaild since it is not a leaf in the tree, but if the last token was A011 it would have been valid
the pattern of symbols tells how many times it is repeated in the uncompressed text, example:
A means the symbol A was only once and that the other characters that follow are not A's
A A0 means 2 A's in sequence in file and next character if it exists is not an A.
A A1 means 3 A's in sequence in file and next character if it exists is not an A
A A0 A00 means only 4 in sequece
A A0 A01 means 5
A A0 A10 means 6
A A0 A11 means 7
A A1 A00 means 8
A A1 A01 means 9
A A1 A10 means 10
A A1 A11 means 11 = 3 + 2*4
A A0 A00 A000 means 12
...
A A1 A11 A111 means 75 = 11 + 2*4*8
...
A A1 A11 A111 A1111 means 1079 = 75 + 2*4*8*16 and so on.
The above was consturcted from the point of view that when decompression occurs, there is always a vaild path so that ''one to one '' compression can take place.

References

1
David A. Scott, The Page Where I start the ''One to One'' Compression discussion THE ''ONE TO ONE'' COMPRESSION PAGE

2
David A. Scott, files from the internet:
Non ''one to one'' adaptive huffman compression
''one to one'' adaptive huffman compression
alternate ''one to one'' adaptive compression
reduced synbol ''one to one'' adaptive compression

3
Web page of www.ics.uci.edu/~dan it is a good site to start from to find information on web about compression.

4
David A. Scott is not a writter and this is his first attempt at writting an article of this type. I graduated from Arizona State University with a BSEE in Electrical Engineering with a major in Fields and Waves. This was the closest thing to a Math Degree without all the English requirements. My main carreer was writting and debugging algorithms for the NAVY at CHINA LAKE CA. form 1970 to 1996. While at CHINA LAKE I was sent to The University of Southern California in 1972 to recieve my MSEE in Electrical Engineering the major field of study was control theroy. I am currently retired living in El Paso, Texas; in the near future I may pursue more education in Jaurez to become a medical doctor.
David A. Scott can be contacted at david_a_scott@email.com but be sure to use underscores on both side of the a in the email address. do not use spaces.


Copyright 1999 David A. Scott