files updated on April 30,2001
The disadvantages of a standard arithmetic compressor are two fold. One you carry a wasted table entry for the EOF symbol which forces other entries to have a slightly longer encoding. Two you have to write an EOF symbol to the compressed file every time you terminate the compressing so many times you add this last character even when its not needed.
There are many ways to make an arithmetic compressor bijective. They just don't seem to be in the open literature. I feel that the reason for this is that compression that is not bijective can greatly weaken encryption.
One clever way to do a bijective arithmetic compressor is to do it the way Matt Timmermans did it with his method of multiple free ends. I am going to do it an alternate way that I hope is easy to follow.
Fist look at a huffman tabel where symbols are arranged in order from least likely to most likely
A 1111 B 1110 C 110 D 101 E 100 F 011 G 010 H 00
Since to end one really wants to make sure the last symbol has a one bit in it. Notice the H symbol does not. So any time a file ends with H or H followed by E's I could add one more E and then I have a unique representation in infinite bit stream I am compressing to.
First thing to note is that I really didn't have to use the E symbol any symbol other than H could have been used. For my simple adaptive aritmetic bijective compressor I choose to use G the next most frequently used symbol. This is great for Huffman where every symbol used has at least one bit and G from table above will always have at least one one bit set. What about in arithmetic?
Well H in arithmetic since its the most common symbol could have less then "one bit". What is lower limit on G ( the next most probable symbol ) That limit is "one bit" and that occurs when H occurs 50% of the time and G occurs the other 50% of the time with rest of symbol occuring 0% of time. Also note any time G is longer than One bit H is as long or longer than G in terms of number of bits. Also notice any time H less then one bit since it makes up more than 50% likely hood and G has to be less than 50% so that its going to be more than one bit in length.
Next now that will know G is at least one bit. How can we be sure that it will contian a one bit.
Look at Top_values its all ones. notice that when you stop playing at the end of encoding each symbol. There is a "High" and a "Low" value state that is carried into start of encoding each symbol. Notice also the low at carryover always starts with a "zero" ( most significant bit) and the high always starst with a "one". What a person is doing is allowing only the next bits to come from that range. Since the second most common symbol is after the first it and it smallest length is 1 it has to have a one.
Here is a good point for you to test my code. It will always be as small or smaller than the old way. it incudes examples you can get it here
The next question is even though its bijective could one save more space than my simple example. Yes one way is by choosing a different arangement of where in the table you place the segments. For example I only write out extra end symbol when the last symbol was all zeroes. It makes sense to make the least occuring symbol to use the all zero segment. But one has to be carful since if you do this you have to assure your self that the most common symbol has partial bit stream that will force a one or you have to make code complicated. One could use different arrangements of segments depending on how many bits have to be written if the next symbol is last symbol. This I think could become quite complicated and one can even consider the final length when its coverted to a file in ones planning of how to arrange the segments for minium length. There might be a payoff in encryption or when one is forced to squeeze last drop out. Such as trying to compress to a small fixed space.
I feel this also leads to a better huffman set for the table than model I used above.
The normal conical model is.
A 0000 B 0001 C 001 D 010 E 011 F 100 G 101 H 11 But one can see this not optimal. Better would be A = 0000 B = 0001 C = 001 D = 011 E = 111 F = 010 G = 110 H = 10
One last comment if you know the last symbol of a file is always "X" then you always drop the last symbol of the file and compress normally and on decompression you add it in.