Fortunately this is a doable task, and the approach rasied in this article could lead others to extend this type of research into other compression methods such as Arithmetic coding, or even those of the Burrows Wheeler Transform.
The key to picking these patterns is the building of the
Huffman table. The concept is quite clever. You start at the top
and draw 2 arrows one to the right and one to the left. At the tip
of each arrow you either draw 2 more arrows or place a symbol.
When you are ready to code the symbol you start at the top and
decide whether to use a 1or a 0 for the left arrow, it makes no
difference which one you pick as long as you use both. One simple way
to do this for a static table is to count all the occurences for each symbol
Example : For a file that contains only 5 charcter types .
A occurs 3 times B occurs 3 times C occurs 2 times D occurs 1 time and
E occurs 1 time.
right then we order them by the decreasing number of times they occured right to left
A 3, B 3, C 2, D 1, E 1
now we combine the last 2 symbols assigning a 0 to one of them and a 1 to the other,
then add the resulting number from the last two symbols
A 3, B 3, C 2, (D=0, E=1) 2 or ( A 3, B 3, C 3, (D=1,E=0) 2)
next reorder the weights of remaining items in decreasing order , I am
only using the first set, but the key 0 or 1 assignment is not unique and
if 2 symbols have the same weight either order is OK.
A 3, B 3, C 2, (D=0, E=1) 2
now combine the last 2 symbols assigning a 0 to one of them and a 1 to the other
then add the resulting number from the last two symbols. Note that when tacking on
the new value we add the bit to left side of string
A 3, B 3, (C=0, D=10, E=11) 4 , note the new bits added to left side of symbols.
now reorder again
(C=0, D=10, E=11) 4, A 3, B 3
combine the last 2 symbols
(C=0, D=10, E=11) 4, (A=0, B=1) 6
since only 2 are left , there is no need to reorder, just combine
(C=00, D=010, E=011, A=10, B=11) 10
We now have one of many possible Huffman codings for the symbols used above.
In summary A=10, B=11, C=00, D=010, E=011
With the above coding any file consisiting of any order of bits 1,0 will
decode into the symbols A,B,C,D,E The only possible trouble could occur
when decoding if the file ends early, for the example above, if the last symbol
has been decompressed and you have only one bit left, what does a person do
since there is no symbol in the table above that is exactly one bit long?.
If the the ratios of the four symbols 'A', 'B', 'C', and 'D' are 4:2:1:1 , then assigning the values of 'A' = 1, 'B' = 01, 'C'=001, and 'D'=000 would be a better assignment than the previous two bit assignments.
One can easily tell by observation that if one had made a set of Huffman values for a set of 256 values, that the shortest coding for all possible symbols would be 8 bits. This is apparent since 2 to the eighth power is 256, but what are the characteristics?
Note that for a file that has this ratio 4:2:1:1, the two bit assignment would make the average file of 8 characters 16 bits long. This is calculated by the 8 characters times the 2 bits for each character. But when we use an optimal Huffman set of code for these 8 characters we get 14 bits. Since the 'A' appears an average of 4 times and it is 1 bit in length it conttributes to a total of 4, while the 'B' which appears 2 times and is 2 bits in length provides 4 bits towards the total , and the remaing two chartacters 'A' and 'B' each provide 3 bits for a total of 6 bits. Then 4 (from the A) plus 4 (from the B) plus 6 (from C and D) gives us the total length of 14 bits.
But there is still the problem that after one compresses a file, if the user wants to decompress it at a later time, he must have a table to tell which pattern of bits were used for each symbol.
This can be solved by either saving the table somewhere else or by adding the table to the file. The drawback of saving the table is that you have to remember where you saved it; the drawback of adding it to the file is that it takes a lot of space to store the complete table in it.
Forunately there is another solution to this problem. that solution is Adaptive Huffman Compression. Adaptive Huffman compression sometimes is worse than Static Huffman compression because it is not optimal in the sense that the Static Huffman compress the optimal for a fixed length coding of each symbol. While Adaptive Huffman compression creates a new table each time it reads a symbol in the input file to be compressed, however since it is not using a fixed length value for each symbol in the file, it is not constrained by the limitations of Static Huffman compression and therefore can sometimes use less space than the Static Huffman compressors; but the real advantage of Adaptive Huffman compression is that it does not need a table to be sent with the file. The problem now is how to add the information of a new symbol into the table, and how to end the file in a reasonable way if the file does not end in a complete last byte.
In many mathematical functions it is nice to have a mapping from one form to another in such a way that any thing in one set maps uniquely to another item in the other set. Also every item in that second set maps back to every item in the first set. A function with this property would be called 'one to one'. Most people routinely dismiss this when it comes to compressors becasue the counting theorem says that no compressor can compress all files in such a way that it makes them all smaller. This is a very powerful theorem but it can be misused, and it causes people to miss the fact that one can build a "one to one" compressor. I will now show you how to do that.
Many people tend to think of compression only from the point of view of how does one make the compressor work. Yet there is another side of the coin, how does one make an uncompressor work; It really is just looking at the other side of the same coin.
One popular way to adaptive Huffman compression is to start with no tree ( symbol table ) at all, and when the first symbol comes in, it gets sent to the output and a starting tree of ( symbol table ) consisting of ''first symbol''=1 and a ''undefined symbol''=0 is initiated, then when the second symbol is read the next output token is either the 1 or the 0 bit, zero meaning undefined symbol, the compression routine then inserts the next ''new symbol'' in the output file by writting it out and then changes its current state of the Huffman table such that ''firt symbol'=1 and "second symbol"=01 and now the ''undefined symbol''=00. This process goes on to the end of the file. And sometimes a special symbol is used to mark the end of file.
I reject this approach beacause it can't lead to a 'One to One' type of compression/decompression [2] routines, if one looks at the uncompressed program, and if the bit pattern for the next new symbol matches one that is already in use, then it leads to an inconsistency that the compressed file was compressed wrong, since the bit patteren following the 'undefined symbol' should only contain patterns that are not used yet. There is a very simple way out of this problem as we head to our goal of a 'one to one' compression routine, start with a filled Huffman tree ( each possible 8 bit character maps to a complete 8 bit character), then when first symbol comes in, you write it out and update the tree such that if a repeat occurs you have a 1 for that symbol. Note that the other 255 symbols now start with a 0 bit followed by at most 8 more bits. However one of the symbols now only has 7 bits following the 0 bit so that you are already saving space over the previuos method that is in wide spread use.
Actually both methods described above are in use. It is just that the second though to be more efficient for compression, is harder to code, and when one makes a new tree there is more work to be done, but that is what computers are for.
At this point lets assume our quest is the 'one to one' compression routine, so we are using the method of a full tree to start with; meaning that from the very beginning we start out with 8 bits for each symbol and there are always 256 symbols defined. The question is how does one end the file so that it is 'one to one', well the fact is most people don't care about this goal since. I think, it is felt to be unattainable because of people's misconception about the 'counting theorem'. Most people end the file with some sort of check to actually give an indication of where the file ends, this can be accomplished by adding a field to the last byte to tell where compression stopped in preceeding byte, or by flat adding a byte count of the file somewhere in the last set of fixed bytes. Those methods fail the 'one to one' property I am striving for.
One can easily create an example that failes to uncompress to a valid file, it is very easy to see this failure when one tries to uncompress a random string. The uncompression routine gets on the last byte, and has not yet come up with a valid symbol based on the current Huiffman table. It does not do any good to have a set of bytes that tell where the file ends, if when decompressing the random file there is no token there. If one defines a special meaning, then one must make sure that the matching compression routines creates this file.
The solution [1] is to create a set of
rules while looking from a point of view
of the uncompressing routine that one can use to make the compressor work.
Since the decompression routine is always looking at 8-bit byte input files and the
Huffman decompression scheme will seldom end on the filling of the last byte,
here is where one must look. To understand the solution of this problem one
needs to understand how the Huffman tree is built and what features we can use
in it. There are more than one solution, and at my site I generate ''one to one''
adaptive Huffman compression methods that use different rules.
What follows is a set of rules that guarantee ''one to one'' compression
in a Huffman file where one uses a table of all 256 leaves:
Fact one: The Huffman tree is such that the all zero token is at least 8 bits long.
Fact two: The Huffman tree is such that the shortest path form any internal node
to a leaf never contains a string of more than 8 ones next to each other.
Fact three: A one byte Huffman file has no hidden bits and decompresses
to one byte since the start tree has 256 leaves, all at a length of 8 bits.
Rule one: To decompress, if the tokens end in such a way that file being decompressed
ends on a full byte, your are done and there is no problem.
Example ...1 11110000 positions labeled 0 to 7 and in position 3 the token S = 10000 is last
decoded you have ended decompression , last symbol out is an "S".
Rule two: If the last Huffman token does not start in the last 7 bits of the last
byte of the compressed file, and yet when at end of byte your at an internal
node of the Huffman tree, you supply the missing ones since there are at the most
8 ones that where chopped off during the compression of that file.
Example ...1 11110000 the token E = 111100001111 was the last token used
in making the compressed file and started on bit 0 of the last byte of file so the
last symbol out of the decompression is "E" , the trailing ones where chopped off.
Rule three: If the last Huffman token does start in the last 7 bits of the
last byte and ends at an internal node of the tree, but the portion of the
token on the last byte of compressed file has at least one or more bits
that are the number one bit, then as in rule two, you supply the hidden ones that
where chopped off the file.
Example ...1 11110000 postion 3 is the start of the last token and in this
case X = 100001, so the symbol used in the example for
rule 1 could not have existed in the current
Huffman tree, so the token did not end at file end and the
hidden ones would be supplied and last
symbol out is "X"
Rule four: ( hardest rule of all) If the last byte has the last token start in
the last 7 bits of the last byte and only contains zeros on that byte
. You assume that there are hidden ones
that where chopped off during compression. Only if the file that is
one token shorter than this one, would have had hidden 1 bits cutoff
during its compression.
This rule means that the all zero start of the last token in the last 7 bits
could depend on what would have happened to the next shorter file,
and that file status could be a function even further back in the file.
It was the lack of focus on my part that made me miss this
recursive rule for a while, and I kept getting more and more complicated
special cases. I hope this solves this.
Example ...1 11110000 M =000011 starts in position 4, but previous token
was F = 11111 , so since if file shorter by one token cut off would
have occurred. therefore the M (the last appearant token) is
used and it is the last charater out..
Rule five: The only case left is the case where a token starts some
where in the last 7 positions of the last byte and contains all zero
bits in its positions , and that the previous file had
ended on the previous token would not have ones cut off during
compression. In this case and only this case the 1 to 7 trailing
zeros which are not a complete symbol mark the end of file and
are not used to form a new character for the output file.
Example ...1 11110000 M=0011and starts in position 6 but the
existence of this M is not important since F = 1100 is the previous token
and no cutoff would have occured so the 00 at the end
tells us that the F was actually the last character
in the uncompressed file.
One can write a ''one to one'' Huffman type of compression routine for any code that uses
9 to 512 symbol types. Since 8-bit ascii only uses at most 256 values one could
assign other leaves for control type of information. I can show you by example
how this works. Think of uppercase letters as any 8 bit symbol, as soon as
you compress a token replace that token with 2 tokens, this makes the
tree contain 257 leaves. In other words you compress a byte just like
in my old method but you create 2 new symbols. If any of the characters are not
repeated, then those go away and two new symbols appear for the next character.
But if the character is repeated one of those special symbols is used and then those
special symbols disappear and twice as many reappear in the tree. This keeps happening
untiil a regular symbol is found or the file ends. This may not seem 'one to one' since
there is a limit of 512 symbols in the tree. Well, then lets only double until there are
128 such symbols and after that the total number stays the same.
You may ask, why the special symbols are used to compress, well here is why.
Take a file "ABCDBBCCCDDDDDDDDEGGG" uncompressed
the number of leaves is 256 when the A is compressed.
then it is 257 when the B is compressed since A on tree replaced with A0 and A1.
that is the leaf with the A is replaced by the label A0 and another leaf is added called A1,
when the C is compressed the old leaves disappear A0 , A1 and B are
replaced by the leaves A, B0 and B1.
when the D is read BO, B1, and D are replaced by B, D0 and D1
when the B is read D0, D1, and B replaced by D, B0 and B1
now B is read but there is no leaf for B since there is only one extra B
and in compressing it I used the B0 token for B and replaced B0 and B1 with
the tokens B00 B01 B10 B11 and these will not be used for the compression
since the next character is a C.
the 3 C's get replaced by C C1 and the 8 D's get replaced by D D1 D00
so the compressed file from a token point of view
is "A B C D B B0 C C1 D D0 D11 E G G1"
note that these are the tokens, and they will not be all of the same length since that
length is whatever the path to the tree is. Now lets look at decompression,
any file can be decomrpessed since there is no stream of tokens derived from the
stream that do not make sense.
"A A" can not appear since once A used it is no longer a leaf in the tree.
likewise "A A0 A11 A0" is not vaild since it is not a leaf in the tree, but if the last token was A011
it would have been valid
the pattern of symbols tells how many times it is repeated in the uncompressed
text, example:
A means the symbol A was only once and that the other characters that follow are not A's
A A0 means 2 A's in sequence in file and next character if it exists is not an A.
A A1 means 3 A's in sequence in file and next character if it exists is not an A
A A0 A00 means only 4 in sequece
A A0 A01 means 5
A A0 A10 means 6
A A0 A11 means 7
A A1 A00 means 8
A A1 A01 means 9
A A1 A10 means 10
A A1 A11 means 11 = 3 + 2*4
A A0 A00 A000 means 12
...
A A1 A11 A111 means 75 = 11 + 2*4*8
...
A A1 A11 A111 A1111 means 1079 = 75 + 2*4*8*16 and so on.
The above was consturcted from the point of view that when decompression occurs, there
is always a vaild path so that ''one to one '' compression can take place.