David' Scott's CONDITIONED HUFFMAN COMPRESSION

files updated on September 23,1999

My main concern is to create compression routines that are of great use to the person who some day may want to compress data before encryption is used. I feel very strongly that a compression routine should be "one to one" on other previous page I give the rules for h2com.exe; However this is not always what one wants. Many times the file that is to be encrypted contains a very limited number of symbols. When one is using an adaptive huffman compression routine that allows for all 256 bit combinations every time a symbol is first used it will contain a very long string this takes up file space. What I have is a h2comx.exe that is run like h2com.exe except there is a third file that is used as a condition file. This file is not compressed but "conditions" the adaptive huffman compressor so that it can start using shorter tokens right away for the compressed text. Note if only the symbols in the condtion file are in the input file then the compressed file should be shorter than what h2com.exe gives. However as shown below if you use the wrong condition file it could compress to a much longer file than the method that does not use a condition file.
If one tries to decompress a file using the condition file. If the compressed file only used the symbols in the condition file those are the only symbols that come back. If one tries to decompress a random file. Most of the symbols that get decompressed map into those used in the condition file. This means that the method in general is not "one to one" if one always wants just the symbols used in the conditon file to come back. One could make a version of this compressor that requires the user to use a table between 128 and 513 leaves or one could make one that uses less entries as little as 9 such that each node ends on a symbol and the longest path is 8. That way any random file will decompress to only the set of symbols in the condition file and this would be a "one to one" compressor. I leave this to the student until I get bored and do this myself. But for the 9 symbol case. A possible coding for the 9 leaves is:
l1 = 1
l2 = 01 
l3 = 001
l4 = 0001
l5 = 00001
l6 = 000001
l7 = 0000001
l8 = 00000001
l9 = 00000000
in this case the there is only one possible shape to a tree that leads to the "one to one" property such that any file decompressed would go to a file with the 9 symbols of interest. This method may seem of little use but it does allow one to take a set such as the upper case letters of the alphabet. Create a starting huffman tree that makes the longest path to a leaf be at least 8 bits. Update the tree as I do but alwasy keep the tree such that longest path is 8 bits. When one is done encrypt the message with you method of choce and then for grins decompress the message after you encrypt it. Know you have a file with exactly the same symbols that you started with. Also if you decrypt this resultant file by doing reverse but using the wrong key the attacker is left with a file that is wrong but still only contains the symbols you choose to use. IF enough people want or if I get bored I will write the routines to do this and batch files to do the whole thing for the standard upper case alphabet using forward and reverese compression passes so that the "all or nothing" property is there. The user can supply the encyption method of choice.

HERE ARE TEST SETS

SET 1 Condition file: 0000 41 42 43 44 45 45 46 47 48 49 4A 4B 4C 4D 4E 4F *ABCDEEFGHIJKLMNO* 0010 50 51 52 53 54 54 55 56 57 58 59 5A 20 2E 30 31 *PQRSTTUVWXYZ .01* 0020 32 33 34 35 36 37 38 39 0D 0A . . . . . . *23456789..* number of bytes is 42 Input file: 0000 54 48 45 20 51 55 49 43 4B 20 42 52 4F 57 4E 20 *THE QUICK BROWN * 0010 46 4F 58 20 4A 55 4D 50 45 44 20 4F 56 45 52 20 *FOX JUMPED OVER * 0020 54 48 45 20 4C 41 5A 59 20 53 4C 4F 57 20 44 4F *THE LAZY SLOW DO* 0030 47 0D 0A . . . . . . . . . . . . . *G..* number of bytes is 51 Output file of H2com.exe 0000 AB 23 18 B3 B0 A1 44 B1 47 19 86 9E 41 C6 18 87 *.#....D.G...A...* 0010 BA 5F C3 83 CB 16 01 A2 87 E6 61 7B 9F 11 07 62 *._........a{...b* 0020 60 3B 4B 8C 91 C8 06 81 0E A1 38 31 E9 42 BC 35 *`;K.......81.B.5* 0030 71 15 5C BD 7B 3C 01 00 . . . . . . . . *q.\.{<..* number of bytes is 56 Output file of H2comx.exe 0000 6C 3E C2 48 0E 1C A1 06 C2 43 59 92 52 C8 62 84 *l>.H.....CY.R.b.* 0010 95 03 2A 9F 31 2A D4 6A 49 84 47 A2 07 3B E9 98 *..*.1*.jI.G..;..* 0020 8C 10 . . . . . . . . . . . . . . *..* number of bytes is 34 Set 2 Condition file: 0000 31 32 33 34 35 36 37 38 39 24 2E 0D 0A 0D 0A . *123456789$.....* number of bytes is 15 Input file: 0000 24 32 33 2E 34 35 0D 0A 24 35 34 2E 37 38 0D 0A *$23.45..$54.78..* 0010 24 31 32 33 2E 36 30 0D 0A 24 31 30 36 2E 39 39 *$123.60..$106.99* 0020 0D 0A . . . . . . . . . . . . . . *..* number of bytes is 34 Output file of H2com.exe 0000 DB 18 02 70 60 F2 4E B0 17 7A 36 E8 DF A1 AA 9A *...p`.N..z6.....* 0010 0D 90 2B 21 C4 1B 3E 85 CA 70 0D FB E2 . . . *..+!..>..p...* number of bytes is 29 Output file of H2comx.exe 0000 58 47 2B DF D4 8D 20 80 02 11 05 10 17 1D 99 88 *XG+... .........* 0010 78 A0 E1 . . . . . . . . . . . . . *x..* number of bytes is 19 Set 3 Output of H2comx.exe Useing input file of Set 1 but Condition file from Set 2 0000 C6 A0 2D 0B 84 9E CC BE 84 F5 2F B2 DD C6 A7 80 *..-......./.....* 0010 E5 0B 2D 3B 8C 7D 26 48 00 D9 01 50 42 95 59 01 *..-;.}&H...PB.Y.* 0020 A6 6F CA 1A 87 D2 DC 79 D9 6A 6B 19 A0 03 8B AE *.o.....y.jk.....* 0030 19 AC C0 F0 E8 75 3D E0 23 33 . . . . . . *.....u=.#3* number of bytes is 58

ENTER here for MY Home Page