David Scott's FOF ONE TO ONE HUFFMAN COMPRESSION

files updated on December 20,1999

My main concern is to create compression routines that are of great use to the person who some day may want to compress data before encryption is used. I feel very strongly that a compression routine should be "one to one" on the previous page I give the rules for h2com.exe; However this was for 256 symbol adaptive huffman compression.

By FOF ONE TO ONE HUFFMAN COMPRESSION I mean using the Finitely Odd File format to an 8 bit byte transform.

It is actaully very easy to do huffman compression and end the data stream in such a way that if one does not need worry about byte boundaries. It is rather easy to make it into a finitely odd bit stream and then use the Transforms to change it to files. Here is the trick to do it. And I wish to thank Matt Timmermans for his version of bijective arithmetic compression what follows is how would would do something similar in huffman compression. The correct way to do huffman compression and to waste not bits and make it a finitely odd bit stream ( means that at least one bit set and last bit that counts for anything is a 1). Take any huffman table but best to assign it in an almost normal form. Where the all zeros is on the right and is the lowest frequently occuring symbol and the all ones symbol is the highest frequency occurring symbol. normal the highest probability ones occur on the left and lowest occur on the right. All ones paths on tree come down on left and zero for the right paths. Here is the symbols results for an example tree:
A = 11
B = 10
C = 011
D = 010
E = 001
F = 000
Expected probability of A > B > C > D > E >F for this example

The Above is a stand tree lets assume static compression and you have sent the static table to a friend but you want to compress the file to 8 bit bytes files in a One to One way (bijective)

You also are using the same table for this example The first thing one should do to save space is reoder the internal values that are pairs that don't include the all Zeroes symbol or the all Ones symbol. In a special way in this case there is only one internal pair C and D. These are an internal pair due to the fact they have the same path through the tree except the last split and do not contain the all zeroes (F) or all ones (A) symbol. To check if C and D should change places. Make sure that the higher probability one has a short short path in the tree. What do I mean by short path. I mean go down the tree and eliminte the trailing ones or zeros and see which has the shortest path in this Case the C has a short path of the single bit 0 while D has the short path 01 and since C occurs more often than D these two don't have to switch positions in the tree. Every symbol but the All zeros and all ones have a short path. We code the last symbol if it is the all zero symbol as all zeros no change. But for the internal symbols we only use the huffman code until there would be a tail of opposite zero-one-ness. If we don't stop at an internal node of which there is the same number as internal symbols -2 ( minus two because of the all zero and all one symbol) Then the last symbol is the all one symbol and no space is wasted. If we get to the leaf of the all ones symbol then it counts as a extra symbol if preceeding the series of the all ones token there is the start of file or any token other than the all zeros symbol.

Same example as above Internal huffman values meaning another symbol is after it
A = 11
B= 10
C = 011
D = 010
E = 001
F = 000

Now the table if it is the last symbol in the compressed file
A = * special if F in front then 11 or groups of 11 count for each A
A = * if not F in front than there is one more A than sets of 11 also the empty symbol at start means one A. We use the nothingness to indicate a lone A being compressed
B= 1
C = 0
D = 01
E = 00
F = 000 note the all zero symbol has no changes

To confuse you more lets add the symbol 1 as the last symbol in the file to make it a finitely odd file and go from starting up in binary. Note in this case since A nothing if only character compressed file then you would have the lone 1
1 means an A
0 1 means a C
1 1 means a B
00 1 means E
01 1 means D
10 1 means BA
11 1 means AA
000 1 means F
001 1 means EA
010 1 means DA
011 1 means CA
100 1 means BE
101 1 means BB
110 1 means AC
111 1 means AB
....
1111111 1 means AAAB
00000000 1 means FFE
Now lets convert this to a 8 bit file using method to convert a finitely odd file to 8 bit byte files. This allows a person to use real files so that every possible value uniquely maps to a set of symbols in a bijective way so that infomation is not wasted as in most compression programs
00000000 is A
10000000 is C
01000000 is B
11000000 is E
10100000 is D
00100000 is BA
01100000 is AA
11100000 is F
11010000 is EA
10010000 is DA
10110000 is CA
00010000 is BE
00110000 is BB
01010000 is AC
01110000 is AB
...
01111111 is AAAB
11111111 is FFE

Below is a the file showing some results converting to and from finitely odd files. Here is int2fin.exe, fin2int.exe these routine can be used to change bases in a "one to one" way. Of course you don't need this if using scott16u or scott19u since they use any byte multiple. At this point you may want the compression routines that go to from FOF files here they are



  This is set of files to convert to and from finitely odd files.
A finitely  odd file can be thought of as a fraction that is .1
to .xxx...xxx10000 rest are zeros so the number is betweem
0 and 1 not including zero or one. Trailing zeros mean nothing
so a file as "34 00" is same as "34 00 00".
The programs are all run by  "program.exe  filein fileout"
PROGRAMS GO TO FINITELY ODD
int2fin.exe   Converts from files made of any number of bytes 
even2fin.exe  Converts from even byte length files
odd2fin.exe   Converts from odd byte length files
idea2fin.exe  Converts from 8 bytes multiple file

PROGRAMS FROM FINITELY ODD FILES to
fin2int.exe  Converts to byte file of any length
fin2even.exe Converts to even numbered byte files
fin2odd.exe  Converts to odd numbered byte files
fin2idea.exe Converts to 8 byte multiple files



AS example lets say we have inputs file alwasy odd
and we want to use IDEA to encrypt.
0000  20 74 68 69 73 20 69 73 20 61 6E 20 6F 64 64 20  * this is an odd *
0010  66 69 6C 65 21  .  .  .  .  .  .  .  .  .  .  .  *file!*
 number of bytes is 21 

Step one is convert the odd bytes numbered file to FOF (finitely
odd fraction) format. I will make a scott16v shortly that
will work with fintely odd files so the encryption will run
on these strange length bit files and then comvert result to bytes
but for know lets go on.
run odd2fin.exe
0000  A0 F4 68 E9 73 A0 69 F3 20 E1 6E A0 6F E4 64 A0  *..h.s.i. .n.o.d.*
0010  66 E9 6C E5 21  .  .  .  .  .  .  .  .  .  .  .  *f.l.!*
 number of bytes is 21 

now convert the finitely odd file to one that is a multiple of
8.  I know for this example the results for file that not already
in a multiple of 8 tend to have 00 bytes at end of file. One could
run a whitening pass to change this. But you do have a unique mapping
of the odd files to and from the 8 byte multiple files.
run fin2idea.exe
0000  20 F4 68 E9 73 A0 69 F3 C0 E1 6E A0 6F E4 64 A0  * .h.s.i...n.o.d.*
0010  A6 E9 6C E5 21 00 00 00  .  .  .  .  .  .  .  .  *..l.!...*
 number of bytes is 24 

The example above shows that for many lengths of files when you convert to a file with a large multiple byte count you can get lots of files that have 00 bytes to fill the last block. Here is a proposal to fill in those bytes. You assume the last byte is a zero if not you add one more block. You get a new block random data and replace the trailing 00 bytes with new random data since you sometime add a block if no trailing 00 there will always be 1 to 8 bytes to play with. Now go to the last byte since we are using this as a count byte you need to code in 0 to 7 so you know how many additional random bytes that need to be replace. Now break the bits up into 3 binary words of 2 bits 3 bits and 3 bits next change the first 2 bits to 3 bits by shifting a zero in to the lower position know XOR (b0,b1,0) with (b2,b3,b4) then take the real number of zeros being replaced and XOR that number to previous result. Now replace (b5,b6,b7) with the results and your ready to encrypt. This what I would do if the need came to fill in those bytes. However I don't see the need coming up since my code is going in the opposite direction. The next scott16f will take a finitely odd file and encrypt it. The adavantage of this is that you don't have to play padding games and your encryption is only encrypting pure data and not data that is padded with something to fill in the dead space. You can alwasys add padding after the encyption since even if the attacker knows what the padding is. It is of no help since it will not be used by the encryption which will be working with finitely odd files where packing is not needed so why bother the encryption with it. Besides this will make scott16f more like scott19u in that the wrap fields will be messy since only one in 8 chance the encrypted bit stream even ends on a byte boundary.

ENTER here for MY Home Page