David' Scott's BIJECTIFED VITTER ADAPTIVE COMPRESSION
THE STORY
This page decribes modification to Karl Malbrains version of Scott Vitters method for
adaptive huffman compression. I wanted to look at this code to compare with mine to see the
difference. My method h2com.exe
is based on a modification of Sayoods work
which was based on the FGK method. However my version gives a different tree
shape compared to the Sayoods since I effectively start with a full tree and just give unused
symbols a weight that is much smaller than the weight for any symbol used the amount updated
is more than the sum of the unused symbols. This results in slow code but I felt it was effective
and fixed the problems in original that made it not bijective.
VH
Basically made as few changes to the Karl's code as possible to get
it to rum in DJGPP port of GNU C to DOS. The code would not run properly without these
basic changes which where "w" and "r" to "wb" and "rb" This allowed this version I
call it VH to compress the Calgary _Corpus test files to 1,829,923 bytes while
H2COM compresses it to 1,829,860 This was surprising since Vitter code as based on
his paper should most likely do a far better job than mine but it does not seem to. So what gives?
VH1
Ok so I tried to make VH bijective without changing any of the subroutines.
I soon found this was impossible. Like Sayoods work years earlier it failed to
correctly incorparate when a new symbol occurs. I had removed the 5 starting bytes
turned off the automatic scaling replaced the I/O routines with bijective bit file I/O
the bit_byte.cpp package
that I wrote in bit_byte.cpp but it still failed to work bijectively. Then I modifed "huff_sendid"
and "huff_readid" The comments in the huffsend routine imply the objective is to
encode the new symbol with as few bits as possible. In many cases it used to many
bits. This had to be corrected to make the code bijective. Bijective compressors waste
no extra bits. Which after all is what compression is really about. The method I used
is simialer to the method I am using in my bijevtive LZW compression. The method I
used is much faster but uses two extra arrays and most importantly wastes no bits
like the method it replaced. I have not seen Vitters original code. But this version
used a slow and poor method of incorparating the new symbol. Any way at this
point VH1 compressed to 1,829,725 This beats my old method and
is a faster way of doing the bijective adaptive huffman compression. Not only
is the tree built in a different order but the adding of the new symbols is very
different. I guess you can teach an old dog new tricks I like this method. VH1 is
a faster bijective adaptive huffman compressor than my original bijective adaptive
huffman compressor and seems to compress typical files better than my old one
based on the test set of files.
VH2
At this point after fixing "huff_sendid and huff_readif" I decided to go all out
and make a version using 65,536 variables instead of the standard everyday
256 variable. This means every two bytes counts as one symbol instead of
every byte. I made it fully bijective to normal files. That means it works for files
or any size just like VH1 the file can be one byte or two bytes long or whatever
one wants. None of the routines in the heart of code changed. Just did a bijective
set of transforms around the reading and wrtting. One can easily see how it works and
long files tend to compress better while short ones not as good. With VH2 the test
set compressed to 1,615,633 bytes which easily tops things such as
the RANGE CODER which is another fast entropy encoder modeled after
an arithmetic coder.
VH3
This was a fast change to VH1 so that condtional bijective adaptive huffman
compression could be used. This was just for fun very little testing. But basically
it allows for one to choose the smbols and the inital start of the tree used for the
compression. VH seemed to come from Karl with a very limited ability to
do this if one did "VH c4 filein fileout" it would compress filein with the
4 symbols 0x000 0x0001 0x002 0x003 a very limited set. But in my version
of VH3 "VH3 c filein fileout conditionfile" if the last file is "ABCXXX" it would
compress filein with the symbols A B C and X. The compression would still
be adaptive and and the tree shape determined by the actaul condition file.
See exmple following for more information:
This is the input file to be tested call it A.TXT
0000 48 45 4C 4C 4F 20 57 4F 52 4C 44 . . . . . *HELLO WORLD*
number of bytes is 11
This is the conditionfile call it COND.TXT
0000 41 42 43 44 45 45 45 45 46 47 4E 49 4A 4B 4C 4D *ABCDEEEEFGNIJKLM*
0010 4F 50 51 52 53 54 54 54 54 54 55 56 57 58 59 5A *OPQRSTTTTTUVWXYZ*
number of bytes is 32
VH3 c A.TXT B.TXT COND.TXT
the above command produces this file call it B.TXT
0000 28 0E 97 ED C3 . . . . . . . . . . . *(....*
number of bytes is 5
VH3 d B.TXT C.TXT COND.TXT
produces this file C.TXT notice H and space not in file
since they where not in condtion file for the preovious
compress
0000 45 4C 4C 4F 57 4F 52 4C 44 . . . . . . . *ELLOWORLD*
number of bytes is 9
VH3 d A.TXT D.TXT COND.TXT
produces following file D.TXT notice that the letters not
found in condtion file not in output and that E and T
which occur more than once are excessively represented.
0000 58 45 4C 54 56 52 47 4D 56 44 51 54 44 45 4C 56 *XELTVRGMVDQTDELV*
0010 49 49 54 45 . . . . . . . . . . . . *IITE*
number of bytes is 20
VH3 c D.TXT E.TXT COND.TXT
produces the file E.TXT which follows. This is same
as starting file since bijective and the same condtion
file used. On decompression only the characters in
condtion file produced so on compress since every thing
used it goes back to starting file. If one is careful one
could do some sort of "weak" simple encryption with this
kind of concept anyway ENJOY
0000 48 45 4C 4C 4F 20 57 4F 52 4C 44 . . . . . *HELLO WORLD*
number of bytes is 11
SUMMARY FOR Calgary _Corpus test files
VH 256 vitter adaptive huffman adapted from Kalr Malbrains code
1,829,923 bytes
H2COM David Scotts first adaptive 256 state huffman modified from Sayood
1,829,860 bytes
VH1 modified VH to be bijectivie
1,829,725 bytes
VH2 modified VH1 for 65,536 states workd any size files
1,615,633 bytes
VH3 special purpose VH1 allow one to choose values compressed
and starting tree
good luck David A. Scott
ENTER here for MY Home Page