This is discusion is based on Mark' BWTCODE.ZIP the file and Mark's
Page is here
The article claims that the package beats PKZIP for the Calgary Corpus
File set. It says it gets 978,122 bytes. This may be from an old
package as I got 978,129 here are the results:
THE BASLINE: RLE BWT MTF RLE ARI
BIB CG1 111,261 08-14-00 5:50p bib.cg1
BOOK1 CG1 768,771 05-08-90 4:21p BOOK1.cg1
BOOK2 CG1 610,856 05-08-90 4:21p BOOK2.cg1
GEO CG1 102,400 05-08-90 4:21p GEO.cg1
NEWS CG1 377,109 05-08-90 4:21p NEWS.cg1
OBJ1 CG1 21,504 05-08-90 4:21p OBJ1.cg1
OBJ2 CG1 246,814 05-08-90 4:22p OBJ2.cg1
PAPER1 CG1 53,161 05-08-90 4:22p PAPER1.cg1
PAPER2 CG1 82,199 05-08-90 4:22p PAPER2.cg1
PAPER3 CG1 46,526 05-08-90 4:22p PAPER3.cg1
PAPER4 CG1 13,286 05-08-90 4:22p PAPER4.cg1
PAPER5 CG1 11,954 05-08-90 4:22p PAPER5.cg1
PAPER6 CG1 38,105 05-08-90 4:22p PAPER6.cg1
PIC CG1 513,216 05-08-90 4:22p PIC.cg1
PROGC CG1 39,611 05-08-90 4:22p PROGC.cg1
PROGL CG1 71,646 05-08-90 4:22p PROGL.cg1
PROGP CG1 49,379 05-08-90 4:22p PROGP.cg1
TRANS CG1 93,695 05-08-90 4:22p TRANS.cg1
18 file(s) 3,251,493 bytes
BIB CF 29,567 09-20-00 12:34p BIB.cf
BOOK1 CF 275,827 09-20-00 12:34p BOOK1.cf
BOOK2 CF 186,588 09-20-00 12:34p BOOK2.cf
GEO CF 62,120 09-20-00 12:34p GEO.cf
NEWS CF 134,184 09-20-00 12:34p NEWS.cf
OBJ1 CF 10,857 09-20-00 12:34p OBJ1.cf
OBJ2 CF 81,948 09-20-00 12:34p OBJ2.cf
PAPER1 CF 17,723 09-20-00 12:34p PAPER1.cf
PAPER2 CF 26,956 09-20-00 12:34p PAPER2.cf
PAPER3 CF 16,994 09-20-00 12:34p PAPER3.cf
PAPER4 CF 5,528 09-20-00 12:34p PAPER4.cf
PAPER5 CF 5,137 09-20-00 12:34p PAPER5.cf
PAPER6 CF 13,161 09-20-00 12:34p PAPER6.cf
PIC CF 50,829 09-20-00 12:34p PIC.cf
PROGC CF 13,315 09-20-00 12:34p PROGC.cf
PROGL CF 16,688 09-20-00 12:34p PROGL.cf
PROGP CF 11,406 09-20-00 12:34p PROGP.cf
TRANS CF 19,301 09-20-00 12:34p TRANS.cf
18 file(s) 978,129 bytes
But the bad news is that this compression is not even close
to bijective due to the excess bagage carried by the extra
data fields and the "?" EOD marker that is in the main data
field.
Since my concern is to use compression and eliminate waste
or added info that is useful in breaking encryption. Lets
exaime how much waste this method added to the compression.
The first problem is the BWT is not fully reversible even
if one ignores the pointer that comes out of it so one can
tell where the original data started example: ab transforms
to ba,1 and ba transforms tp ba,0 but there is no reverse
transform for ab,0 or ab,1 which means that if one is trying
to do reverse transforms on random data, one would most likely
end up with something not reversible. This problem can be
handled and reduced. I will do so later. But the question
is how much does this effect have on the compression encryption
system.
This problem has to do with cycles and when one forms the T[]
vector during the UNBWT you would generally like it to cycle through
each value without getting in a short cycle. This means that if
you look at a series of symbols and form the T[] what is the
likelyhood that the data will happen to be in a single cycle.
The worst case occurs when all symbols different. This puts
an upper limit on how often one would have to try an inverse
BWT before you get one that works. It is not possible to get
separate symbols for files as long as 2**16 bytes, but so I
can use the upper limit I will assume you can. This means that
1 in 2**16 trys would be succsessful. So you would lose about
16 bits of key strength in your encryption. My goal
for this page is to modify the added info in Mark's system
so that this is the only effect left. Getting as close as possible
through the rest of operations leading to a compressed file.
Suppose one was using Skipjack an 80 bit NSA method the government wanted to cram down our throats. If you built a code pakage like PGP but used Skipjack for the encryption phase and Mark's BWT compression for the compression phase. Your in big trouble. Beacause the extra fields alone in very first stage is enough to make sure that only one possible decryption exists. Namely the one you wanted to hide. In the front of file you get a 4 bytes length field. Which is a waste of space since you can read the file and see how long it is. If you read the file you can check exactly to see that it contains the correct 32 bits of info ( if the file is to big for you buffer then you still know what's in the first length field so it is never needed). Lets pretend its a short file of less than 65,000 bytes. In that case the pointer to "last" has to point to a character in the BWT data filed and that chacter has to be exactly "?" so thats another 32 + 8 for 72 bits specifed. Also the "first" has to have the last 16 bits set to zero for this short file so now we are at 88 bits. We still have not tested to see if the reverse BWT exists or if the "first" and "last" are in lock step when the T[] is calculated. But enough bits are so specifed that with an 80 bit key encryption scheme there is already an overwhelming chance that there can only be one solution to this. The one your trying to hide. And I haven't even got into the other nonbijective aspects of this set of compressions processes the RLE and the ARI routines are both nonbijective.
The first improvement one can make to get rid of all the extra fields one does not need. The EOD "?" that Mark added can be dropped by doing a full BWT and not adding an extra character. You only need one pointer field. You can use first middle last your choice. I choose last. You can drop the length field. Since even if you use small blocks. You still should know where everything is. One can also try to use a single block which is what I did in BWTQ. Then there is the nasty problem of adding the Pointer Field in an optimal way such that no information is provided to an attacker see DSC the source code of BWTQ and DSC along with test programs source code and PC executables is here
Here are the results of changing BWT to BWTQ and adding DSC:
USING RLE BWTQ DSC MTF RLE ARI
BIB CF 29,553 09-20-00 12:41p BIB.cf
BOOK1 CF 251,659 09-20-00 12:41p BOOK1.cf
BOOK2 CF 170,050 09-20-00 12:41p BOOK2.cf
GEO CF 62,109 09-20-00 12:41p GEO.cf
NEWS CF 125,688 09-20-00 12:41p NEWS.cf
OBJ1 CF 10,848 09-20-00 12:41p OBJ1.cf
OBJ2 CF 79,951 09-20-00 12:41p OBJ2.cf
PAPER1 CF 17,709 09-20-00 12:41p PAPER1.cf
PAPER2 CF 26,941 09-20-00 12:41p PAPER2.cf
PAPER3 CF 16,979 09-20-00 12:41p PAPER3.cf
PAPER4 CF 5,518 09-20-00 12:41p PAPER4.cf
PAPER5 CF 5,122 09-20-00 12:41p PAPER5.cf
PAPER6 CF 13,148 09-20-00 12:41p PAPER6.cf
PIC CF 50,819 09-20-00 12:41p PIC.cf
PROGC CF 13,302 09-20-00 12:41p PROGC.cf
PROGL CF 16,676 09-20-00 12:41p PROGL.cf
PROGP CF 11,395 09-20-00 12:41p PROGP.cf
TRANS CF 19,293 09-20-00 12:41p TRANS.cf
18 file(s) 926,760 bytes
Note there are still big problems here since Mark's RLE is not 1-1 SEE MY UNADULTERATED RLE The RLE he choose to use is very poor since it adds info to the compressed file that is not needed except by one trying to break the code.
At this point there are many chioces to try to make the rest of operations 1 - 1. I could use my drop in replacment for RLE as mentioned above. But I decided I did not like the fact RLE is done twice and that the MTF maps to the "zero symbol" why add new symbols if there are not needed. So I created a MTFQ which just calls the first symbol seen as "ZERO" and the second one seen as "ONE" and so on. But other than that change of never introducing a symbol not used, it is exacatly like MTF. The other routine modifed here was the RLE. Everytime I think about RLE I come up with something different. In this case for fun I decided to make a 1-1 RLE compressor that does its best not to add new symbols just like the MTFQ. SO I call it the A B C of RLE's since it uses first 3 symbols seen in file for the compression it is RLEQ. The source code and PC executables are in the file here
Here is the results of these addtional changes:
Using BWTQ DSC MTFQ RLEQ ARI
BIB CF 29,166 09-20-00 1:08p BIB.cf
BOOK1 CF 246,017 09-20-00 1:08p BOOK1.cf
BOOK2 CF 166,706 09-20-00 1:09p BOOK2.cf
GEO CF 61,508 09-20-00 1:09p GEO.cf
NEWS CF 123,735 09-20-00 1:09p NEWS.cf
OBJ1 CF 10,705 09-20-00 1:09p OBJ1.cf
OBJ2 CF 79,021 09-20-00 1:09p OBJ2.cf
PAPER1 CF 17,426 09-20-00 1:09p PAPER1.cf
PAPER2 CF 26,413 09-20-00 1:09p PAPER2.cf
PAPER3 CF 16,674 09-20-00 1:09p PAPER3.cf
PAPER4 CF 5,375 09-20-00 1:09p PAPER4.cf
PAPER5 CF 4,977 09-20-00 1:09p PAPER5.cf
PAPER6 CF 12,894 09-20-00 1:09p PAPER6.cf
PIC CF 51,814 09-20-00 1:09p PIC.cf
PROGC CF 13,185 09-20-00 1:09p PROGC.cf
PROGL CF 16,713 09-20-00 1:09p PROGL.cf
PROGP CF 11,445 09-20-00 1:09p PROGP.cf
TRANS CF 18,797 09-20-00 1:09p TRANS.cf
18 file(s) 912,571 bytes
Note there still are big problems here since the Arithmetic coder is not the kind one uses when one wants to compress before an encryption pass since it uses the Old fashion concept of an EOF that wastes space by being carried along and then used only once at the end of file.
In the next section dropping the old style arithmetic coder for my orginal 1-1 huffman compressor.
The problem with the above is that an enemy can still gain to much knowledge about wether the supplied key can be the actually key used. The added info is so large that in many cases this file is such that it can only be decompressed correctly if the correct "key" was used for the decryption. This is not good in the coming world when quantum computers will be more common. In the non black world IBM already has working quantum computers. Here the result of replacing ARI with h2com.exe
Here are the results that follow:
Using BWTQ DSC MTFQ RLEQ H2COM
BIB CF 29,572 09-20-00 12:59p BIB.cf
BOOK1 CF 254,607 09-20-00 12:59p BOOK1.cf
BOOK2 CF 173,068 09-20-00 12:59p BOOK2.cf
GEO CF 63,223 09-20-00 12:59p GEO.cf
NEWS CF 126,947 09-20-00 12:59p NEWS.cf
OBJ1 CF 10,868 09-20-00 12:59p OBJ1.cf
OBJ2 CF 80,631 09-20-00 12:59p OBJ2.cf
PAPER1 CF 17,676 09-20-00 12:59p PAPER1.cf
PAPER2 CF 26,808 09-20-00 12:59p PAPER2.cf
PAPER3 CF 16,779 09-20-00 12:59p PAPER3.cf
PAPER4 CF 5,339 09-20-00 12:59p PAPER4.cf
PAPER5 CF 4,989 09-20-00 12:59p PAPER5.cf
PAPER6 CF 13,038 09-20-00 12:59p PAPER6.cf
PIC CF 54,524 09-20-00 12:59p PIC.cf
PROGC CF 13,338 09-20-00 12:59p PROGC.cf
PROGL CF 16,839 09-20-00 12:59p PROGL.cf
PROGP CF 11,476 09-20-00 12:59p PROGP.cf
TRANS CF 18,832 09-20-00 12:59p TRANS.cf
18 file(s) 938,554 bytes
Notice even though this is a better compression for files that will be encrypted. It only bets the Arithmetic coder in making a smaller file for the file PAPER4.CG1 That the ARI compresses better than a Huffman coder is not a surprise since if all other things are equal a huffman coder is a subset of an Arithmetic coder. I could stop here since my task of making a BWT type of compressor for files that are to be encrypted has ended. And it compresses smaller than the orignal one sugguest by Mark. However it does not give one much satisfaction that the ARI usually compresses these Corpus files to a smaller size. What is one to do. SEE NEXT SECTION
We are in luck there is a 1-1 arithmetic compress. Matt Timmermans bijective compress SEE MATTS PAGE HERE and get source code and discussions of what he did.
Here are the results of that mode:
Using BWTQ DSC MTFQ RLEQ BIACODE
BIB CF 28,798 09-20-00 1:16p BIB.cf
BOOK1 CF 242,197 09-20-00 1:16p BOOK1.cf
BOOK2 CF 164,877 09-20-00 1:17p BOOK2.cf
GEO CF 60,523 09-20-00 1:17p GEO.cf
NEWS CF 122,989 09-20-00 1:17p NEWS.cf
OBJ1 CF 10,724 09-20-00 1:17p OBJ1.cf
OBJ2 CF 79,294 09-20-00 1:17p OBJ2.cf
PAPER1 CF 17,131 09-20-00 1:17p PAPER1.cf
PAPER2 CF 25,989 09-20-00 1:17p PAPER2.cf
PAPER3 CF 16,351 09-20-00 1:17p PAPER3.cf
PAPER4 CF 5,234 09-20-00 1:17p PAPER4.cf
PAPER5 CF 4,848 09-20-00 1:17p PAPER5.cf
PAPER6 CF 12,670 09-20-00 1:17p PAPER6.cf
PIC CF 51,416 09-20-00 1:17p PIC.cf
PROGC CF 13,025 09-20-00 1:17p PROGC.cf
PROGL CF 16,463 09-20-00 1:17p PROGL.cf
PROGP CF 11,233 09-20-00 1:17p PROGP.cf
TRANS CF 18,652 09-20-00 1:17p TRANS.cf
18 file(s) 902,414 bytes
Notice this seems to compress much better than ARI. I hoped it would since it is not carrying extra space for an unneeded EOF character that the old style is using. So this is a better compression set for files that are to be encrypted and it seems to be a better compressor than ARI all the way around. ARI could be fixed to eliminate the wasteful EOF that it carries but for now one can use Matt's if they want more secure compression for files that are to be encrypted.
It is clear that if one use a raw BWT that a DSC type of program needs to be used after it. However there are ways to make the required pointer smaller so that the pointer value covers less range. It involves minor changes to the BWT and UNBWT to allow more than the single cycle of data. I may do that for next project since it seems easy to do and any forced reduction of N by one bit means that an attacker would get twice as many passable false files when one decrypts.
It is clear that there are many possibilites for the MTF RLE combinations and that different types should be selected if one wishes different versions of the programs for special files. It is not possible to make one version compress all files. Since every real compressor increases most files.
It is also clear that the last stage should be a compressor that does not need a specail EOF character since not only does the EOF character waste space but makes the job of an attacker much easier.
ALL RESULTS COMPRESSION OF CALGARY CORPUS FILE SET
FROM MARKS WEBPAGE
PKZIP 1,071,986
BWTCODE 978,122
FROM MY TESTS
RLE BWT MTF RLE ARI 978,129
RLE BWTQ DSC MTF RLE ARI 926,760
BWTQ DSC MTFQ RLEQ ARI 912,571
ONLY THE FOLLOWING TWO SHOULD BE USED IF ONE COMPRESSING THEN ENCRYPTING
BWTQ DSC MTFQ RLEQ H2COM 938,554
BWTQ DSC MTFQ RLEQ BIACODE 902,414 ** the best of all worlds **
This means that quantum computers will not find a unique solution which until now was not possible. If one was using the old BWT compression there would like be only one solution making it much easier to break.
This close to Bijectiveness was done by the use of a DSC routine which combines the pointer object to the BWT transform in a optimal way. Then all the later stages are made 1-1 with the last stage being a bijective arithmetic coder.
The file bibwt1.zip contains the PC executables and batch files to do the last 3 types of compressions and their inverses for those wishing to test it.