David' Scott's BIJECTIFED VITTER ADAPTIVE COMPRESSION

THE STORY

vh2 updated on June 29,2005
the new VH2


files updated on December 6,2002
The Vitter style adaptive huffman code



This page decribes modification to Karl Malbrains version of Scott Vitters method for adaptive huffman compression. I wanted to look at this code to compare with mine to see the difference. My method h2com.exe is based on a modification of Sayoods work which was based on the FGK method. However my version gives a different tree shape compared to the Sayoods since I effectively start with a full tree and just give unused symbols a weight that is much smaller than the weight for any symbol used the amount updated is more than the sum of the unused symbols. This results in slow code but I felt it was effective and fixed the problems in original that made it not bijective.



VH

Basically made as few changes to the Karl's code as possible to get it to rum in DJGPP port of GNU C to DOS. The code would not run properly without these basic changes which where "w" and "r" to "wb" and "rb" This allowed this version I call it VH to compress the Calgary _Corpus test files to 1,829,923 bytes while H2COM compresses it to 1,829,860 This was surprising since Vitter code as based on his paper should most likely do a far better job than mine but it does not seem to. So what gives?



VH1

Ok so I tried to make VH bijective without changing any of the subroutines. I soon found this was impossible. Like Sayoods work years earlier it failed to correctly incorparate when a new symbol occurs. I had removed the 5 starting bytes turned off the automatic scaling replaced the I/O routines with bijective bit file I/O the bit_byte.cpp package that I wrote in bit_byte.cpp but it still failed to work bijectively. Then I modifed "huff_sendid" and "huff_readid" The comments in the huffsend routine imply the objective is to encode the new symbol with as few bits as possible. In many cases it used to many bits. This had to be corrected to make the code bijective. Bijective compressors waste no extra bits. Which after all is what compression is really about. The method I used is simialer to the method I am using in my bijevtive LZW compression. The method I used is much faster but uses two extra arrays and most importantly wastes no bits like the method it replaced. I have not seen Vitters original code. But this version used a slow and poor method of incorparating the new symbol. Any way at this point VH1 compressed to 1,829,725 This beats my old method and is a faster way of doing the bijective adaptive huffman compression. Not only is the tree built in a different order but the adding of the new symbols is very different. I guess you can teach an old dog new tricks I like this method. VH1 is a faster bijective adaptive huffman compressor than my original bijective adaptive huffman compressor and seems to compress typical files better than my old one based on the test set of files.



VH2

At this point after fixing "huff_sendid and huff_readif" I decided to go all out and make a version using 65,536 variables instead of the standard everyday 256 variable. This means every two bytes counts as one symbol instead of every byte. I made it fully bijective to normal files. That means it works for files or any size just like VH1 the file can be one byte or two bytes long or whatever one wants. None of the routines in the heart of code changed. Just did a bijective set of transforms around the reading and wrtting. One can easily see how it works and long files tend to compress better while short ones not as good. With VH2 the test set compressed to 1,615,633 bytes which easily tops things such as the RANGE CODER which is another fast entropy encoder modeled after an arithmetic coder.



VH3

This was a fast change to VH1 so that condtional bijective adaptive huffman compression could be used. This was just for fun very little testing. But basically it allows for one to choose the smbols and the inital start of the tree used for the compression. VH seemed to come from Karl with a very limited ability to do this if one did "VH c4 filein fileout" it would compress filein with the 4 symbols 0x000 0x0001 0x002 0x003 a very limited set. But in my version of VH3 "VH3 c filein fileout conditionfile" if the last file is "ABCXXX" it would compress filein with the symbols A B C and X. The compression would still be adaptive and and the tree shape determined by the actaul condition file. See exmple following for more information:


This is the input file to be tested call it A.TXT
0000  48 45 4C 4C 4F 20 57 4F 52 4C 44  .  .  .  .  .  *HELLO WORLD*
 number of bytes is 11 

This is the conditionfile call it COND.TXT
0000  41 42 43 44 45 45 45 45 46 47 4E 49 4A 4B 4C 4D  *ABCDEEEEFGNIJKLM*
0010  4F 50 51 52 53 54 54 54 54 54 55 56 57 58 59 5A  *OPQRSTTTTTUVWXYZ*
 number of bytes is 32 

VH3 c A.TXT B.TXT COND.TXT
the above command produces this file call it B.TXT
0000  28 0E 97 ED C3  .  .  .  .  .  .  .  .  .  .  .  *(....*
 number of bytes is 5 

VH3 d B.TXT C.TXT COND.TXT
produces this file C.TXT notice H and space not in file
since they where not in condtion file for the preovious
compress
0000  45 4C 4C 4F 57 4F 52 4C 44  .  .  .  .  .  .  .  *ELLOWORLD*
 number of bytes is 9 

VH3 d A.TXT D.TXT COND.TXT
produces following file D.TXT notice that the letters not
found in condtion file not in output and that E and T
which occur more than once are excessively represented.
0000  58 45 4C 54 56 52 47 4D 56 44 51 54 44 45 4C 56  *XELTVRGMVDQTDELV*
0010  49 49 54 45  .  .  .  .  .  .  .  .  .  .  .  .  *IITE*
 number of bytes is 20 

VH3 c D.TXT E.TXT COND.TXT
produces the file E.TXT which follows. This is same
as starting file since bijective and the same condtion
file used. On decompression only the characters in
condtion file produced so on compress since every thing
used it goes back to starting file. If one is careful one
could do some sort of "weak" simple encryption with this
kind of concept anyway ENJOY
0000  48 45 4C 4C 4F 20 57 4F 52 4C 44  .  .  .  .  .  *HELLO WORLD*
 number of bytes is 11 




SUMMARY FOR Calgary _Corpus test files


VH  256 vitter adaptive huffman adapted from Kalr Malbrains code
1,829,923 bytes 

H2COM David Scotts first adaptive 256 state huffman modified from Sayood
1,829,860 bytes

VH1 modified VH to be bijectivie
1,829,725 bytes

VH2 modified VH1 for 65,536 states workd any size files
1,615,633 bytes


VH3 special purpose VH1 allow one to choose values compressed
 and starting tree
good luck David A. Scott
ENTER here for MY Home Page