David's View of BWT (Burrows Wheeler Transform) in Compression before Encryption


last update 2000 sept 21

This page looks at a simple BWT compression package and suggests modifications that make it useful for a compression pass before encryption, and hopefully will give ideas so others can improve basic compression packages so they will compress not only better but so that they will also be useful for encryption.

Table of Contents


Why is there a problem with compression and encryption

The problem with most compression methods is that they don't concern themsevles with making the reverse compression work for all files. This leads to the unhappy feature that if you compress your files and then encrypt. Or worse yet as in PGP its done for you so that you have little say in the matter. When an ememy tests by whatever means some subset of the "keys" used in the encryption it would be nice if many such keys lead to a false file to delay an enemy from finding the true message. As quantum computers become more of a working reality (the NSA may already have them) then it would be nice to use compression encryption combinations that do not have a unique solution so that a quantum computer does not find it. Since it seems this particular type of computer will be best at problems where the solution can settle down to some fixed state. At present there are very few 1-1 compression schemes out in the open. My huffman schemes and Matt's arithmetic coder are the only major ones I know of. It would be nice to get one based on the BWT so that much higher compression ratios could be achieved. And becasue the BWT itself would prevent many forms of partial choosen plain text attacks. At least like those that destroyted Enigma. Since the data is scattered across several code blocks and it would be unlikely if implemented properly that looking at only a few blocks of encyrpted text would lead an enemy to the correct key. The problem with most ciphers is that if an enemy konws part of a message he may only exaime the blocks where that part of message occurs. This is not possible if one use a BWT compression scheme in the correct way.


Mark' BWTCODE and why it's not good with encryption

This is discusion is based on Mark' BWTCODE.ZIP the file and Mark's Page is here The article claims that the package beats PKZIP for the Calgary Corpus File set. It says it gets 978,122 bytes. This may be from an old package as I got 978,129 here are the results:


 THE BASLINE: RLE BWT MTF RLE ARI

BIB      CG1       111,261  08-14-00  5:50p bib.cg1
BOOK1    CG1       768,771  05-08-90  4:21p BOOK1.cg1
BOOK2    CG1       610,856  05-08-90  4:21p BOOK2.cg1
GEO      CG1       102,400  05-08-90  4:21p GEO.cg1
NEWS     CG1       377,109  05-08-90  4:21p NEWS.cg1
OBJ1     CG1        21,504  05-08-90  4:21p OBJ1.cg1
OBJ2     CG1       246,814  05-08-90  4:22p OBJ2.cg1
PAPER1   CG1        53,161  05-08-90  4:22p PAPER1.cg1
PAPER2   CG1        82,199  05-08-90  4:22p PAPER2.cg1
PAPER3   CG1        46,526  05-08-90  4:22p PAPER3.cg1
PAPER4   CG1        13,286  05-08-90  4:22p PAPER4.cg1
PAPER5   CG1        11,954  05-08-90  4:22p PAPER5.cg1
PAPER6   CG1        38,105  05-08-90  4:22p PAPER6.cg1
PIC      CG1       513,216  05-08-90  4:22p PIC.cg1
PROGC    CG1        39,611  05-08-90  4:22p PROGC.cg1
PROGL    CG1        71,646  05-08-90  4:22p PROGL.cg1
PROGP    CG1        49,379  05-08-90  4:22p PROGP.cg1
TRANS    CG1        93,695  05-08-90  4:22p TRANS.cg1
        18 file(s)      3,251,493 bytes


BIB      CF         29,567  09-20-00 12:34p BIB.cf
BOOK1    CF        275,827  09-20-00 12:34p BOOK1.cf
BOOK2    CF        186,588  09-20-00 12:34p BOOK2.cf
GEO      CF         62,120  09-20-00 12:34p GEO.cf
NEWS     CF        134,184  09-20-00 12:34p NEWS.cf
OBJ1     CF         10,857  09-20-00 12:34p OBJ1.cf
OBJ2     CF         81,948  09-20-00 12:34p OBJ2.cf
PAPER1   CF         17,723  09-20-00 12:34p PAPER1.cf
PAPER2   CF         26,956  09-20-00 12:34p PAPER2.cf
PAPER3   CF         16,994  09-20-00 12:34p PAPER3.cf
PAPER4   CF          5,528  09-20-00 12:34p PAPER4.cf
PAPER5   CF          5,137  09-20-00 12:34p PAPER5.cf
PAPER6   CF         13,161  09-20-00 12:34p PAPER6.cf
PIC      CF         50,829  09-20-00 12:34p PIC.cf
PROGC    CF         13,315  09-20-00 12:34p PROGC.cf
PROGL    CF         16,688  09-20-00 12:34p PROGL.cf
PROGP    CF         11,406  09-20-00 12:34p PROGP.cf
TRANS    CF         19,301  09-20-00 12:34p TRANS.cf
        18 file(s)        978,129 bytes

But the bad news is that this compression is not even close to bijective due to the excess bagage carried by the extra data fields and the "?" EOD marker that is in the main data field.
Since my concern is to use compression and eliminate waste or added info that is useful in breaking encryption. Lets exaime how much waste this method added to the compression. The first problem is the BWT is not fully reversible even if one ignores the pointer that comes out of it so one can tell where the original data started example: ab transforms to ba,1 and ba transforms tp ba,0 but there is no reverse transform for ab,0 or ab,1 which means that if one is trying to do reverse transforms on random data, one would most likely end up with something not reversible. This problem can be handled and reduced. I will do so later. But the question is how much does this effect have on the compression encryption system.
This problem has to do with cycles and when one forms the T[] vector during the UNBWT you would generally like it to cycle through each value without getting in a short cycle. This means that if you look at a series of symbols and form the T[] what is the likelyhood that the data will happen to be in a single cycle. The worst case occurs when all symbols different. This puts an upper limit on how often one would have to try an inverse BWT before you get one that works. It is not possible to get separate symbols for files as long as 2**16 bytes, but so I can use the upper limit I will assume you can. This means that 1 in 2**16 trys would be succsessful. So you would lose about 16 bits of key strength in your encryption. My goal for this page is to modify the added info in Mark's system so that this is the only effect left. Getting as close as possible through the rest of operations leading to a compressed file.

Suppose one was using Skipjack an 80 bit NSA method the government wanted to cram down our throats. If you built a code pakage like PGP but used Skipjack for the encryption phase and Mark's BWT compression for the compression phase. Your in big trouble. Beacause the extra fields alone in very first stage is enough to make sure that only one possible decryption exists. Namely the one you wanted to hide. In the front of file you get a 4 bytes length field. Which is a waste of space since you can read the file and see how long it is. If you read the file you can check exactly to see that it contains the correct 32 bits of info ( if the file is to big for you buffer then you still know what's in the first length field so it is never needed). Lets pretend its a short file of less than 65,000 bytes. In that case the pointer to "last" has to point to a character in the BWT data filed and that chacter has to be exactly "?" so thats another 32 + 8 for 72 bits specifed. Also the "first" has to have the last 16 bits set to zero for this short file so now we are at 88 bits. We still have not tested to see if the reverse BWT exists or if the "first" and "last" are in lock step when the T[] is calculated. But enough bits are so specifed that with an 80 bit key encryption scheme there is already an overwhelming chance that there can only be one solution to this. The one your trying to hide. And I haven't even got into the other nonbijective aspects of this set of compressions processes the RLE and the ARI routines are both nonbijective.


First improvement BWTQ and DSC

The first improvement one can make to get rid of all the extra fields one does not need. The EOD "?" that Mark added can be dropped by doing a full BWT and not adding an extra character. You only need one pointer field. You can use first middle last your choice. I choose last. You can drop the length field. Since even if you use small blocks. You still should know where everything is. One can also try to use a single block which is what I did in BWTQ. Then there is the nasty problem of adding the Pointer Field in an optimal way such that no information is provided to an attacker see DSC the source code of BWTQ and DSC along with test programs source code and PC executables is here

Here are the results of changing BWT to BWTQ and adding DSC:


 USING RLE BWTQ DSC MTF RLE ARI


BIB      CF         29,553  09-20-00 12:41p BIB.cf
BOOK1    CF        251,659  09-20-00 12:41p BOOK1.cf
BOOK2    CF        170,050  09-20-00 12:41p BOOK2.cf
GEO      CF         62,109  09-20-00 12:41p GEO.cf
NEWS     CF        125,688  09-20-00 12:41p NEWS.cf
OBJ1     CF         10,848  09-20-00 12:41p OBJ1.cf
OBJ2     CF         79,951  09-20-00 12:41p OBJ2.cf
PAPER1   CF         17,709  09-20-00 12:41p PAPER1.cf
PAPER2   CF         26,941  09-20-00 12:41p PAPER2.cf
PAPER3   CF         16,979  09-20-00 12:41p PAPER3.cf
PAPER4   CF          5,518  09-20-00 12:41p PAPER4.cf
PAPER5   CF          5,122  09-20-00 12:41p PAPER5.cf
PAPER6   CF         13,148  09-20-00 12:41p PAPER6.cf
PIC      CF         50,819  09-20-00 12:41p PIC.cf
PROGC    CF         13,302  09-20-00 12:41p PROGC.cf
PROGL    CF         16,676  09-20-00 12:41p PROGL.cf
PROGP    CF         11,395  09-20-00 12:41p PROGP.cf
TRANS    CF         19,293  09-20-00 12:41p TRANS.cf
        18 file(s)        926,760 bytes

Note there are still big problems here since Mark's RLE is not 1-1 SEE MY UNADULTERATED RLE The RLE he choose to use is very poor since it adds info to the compressed file that is not needed except by one trying to break the code.


Next improvement Dropping non 1-1 RLE

At this point there are many chioces to try to make the rest of operations 1 - 1. I could use my drop in replacment for RLE as mentioned above. But I decided I did not like the fact RLE is done twice and that the MTF maps to the "zero symbol" why add new symbols if there are not needed. So I created a MTFQ which just calls the first symbol seen as "ZERO" and the second one seen as "ONE" and so on. But other than that change of never introducing a symbol not used, it is exacatly like MTF. The other routine modifed here was the RLE. Everytime I think about RLE I come up with something different. In this case for fun I decided to make a 1-1 RLE compressor that does its best not to add new symbols just like the MTFQ. SO I call it the A B C of RLE's since it uses first 3 symbols seen in file for the compression it is RLEQ. The source code and PC executables are in the file here

Here is the results of these addtional changes:


Using BWTQ DSC MTFQ RLEQ ARI 

BIB      CF         29,166  09-20-00  1:08p BIB.cf
BOOK1    CF        246,017  09-20-00  1:08p BOOK1.cf
BOOK2    CF        166,706  09-20-00  1:09p BOOK2.cf
GEO      CF         61,508  09-20-00  1:09p GEO.cf
NEWS     CF        123,735  09-20-00  1:09p NEWS.cf
OBJ1     CF         10,705  09-20-00  1:09p OBJ1.cf
OBJ2     CF         79,021  09-20-00  1:09p OBJ2.cf
PAPER1   CF         17,426  09-20-00  1:09p PAPER1.cf
PAPER2   CF         26,413  09-20-00  1:09p PAPER2.cf
PAPER3   CF         16,674  09-20-00  1:09p PAPER3.cf
PAPER4   CF          5,375  09-20-00  1:09p PAPER4.cf
PAPER5   CF          4,977  09-20-00  1:09p PAPER5.cf
PAPER6   CF         12,894  09-20-00  1:09p PAPER6.cf
PIC      CF         51,814  09-20-00  1:09p PIC.cf
PROGC    CF         13,185  09-20-00  1:09p PROGC.cf
PROGL    CF         16,713  09-20-00  1:09p PROGL.cf
PROGP    CF         11,445  09-20-00  1:09p PROGP.cf
TRANS    CF         18,797  09-20-00  1:09p TRANS.cf
        18 file(s)        912,571 bytes

Note there still are big problems here since the Arithmetic coder is not the kind one uses when one wants to compress before an encryption pass since it uses the Old fashion concept of an EOF that wastes space by being carried along and then used only once at the end of file.

In the next section dropping the old style arithmetic coder for my orginal 1-1 huffman compressor.


Changing last stage to a 1-1 Huffman coder

The problem with the above is that an enemy can still gain to much knowledge about wether the supplied key can be the actually key used. The added info is so large that in many cases this file is such that it can only be decompressed correctly if the correct "key" was used for the decryption. This is not good in the coming world when quantum computers will be more common. In the non black world IBM already has working quantum computers. Here the result of replacing ARI with h2com.exe

Here are the results that follow:


Using BWTQ DSC MTFQ RLEQ H2COM

BIB      CF         29,572  09-20-00 12:59p BIB.cf
BOOK1    CF        254,607  09-20-00 12:59p BOOK1.cf
BOOK2    CF        173,068  09-20-00 12:59p BOOK2.cf
GEO      CF         63,223  09-20-00 12:59p GEO.cf
NEWS     CF        126,947  09-20-00 12:59p NEWS.cf
OBJ1     CF         10,868  09-20-00 12:59p OBJ1.cf
OBJ2     CF         80,631  09-20-00 12:59p OBJ2.cf
PAPER1   CF         17,676  09-20-00 12:59p PAPER1.cf
PAPER2   CF         26,808  09-20-00 12:59p PAPER2.cf
PAPER3   CF         16,779  09-20-00 12:59p PAPER3.cf
PAPER4   CF          5,339  09-20-00 12:59p PAPER4.cf
PAPER5   CF          4,989  09-20-00 12:59p PAPER5.cf
PAPER6   CF         13,038  09-20-00 12:59p PAPER6.cf
PIC      CF         54,524  09-20-00 12:59p PIC.cf
PROGC    CF         13,338  09-20-00 12:59p PROGC.cf
PROGL    CF         16,839  09-20-00 12:59p PROGL.cf
PROGP    CF         11,476  09-20-00 12:59p PROGP.cf
TRANS    CF         18,832  09-20-00 12:59p TRANS.cf
        18 file(s)        938,554 bytes

Notice even though this is a better compression for files that will be encrypted. It only bets the Arithmetic coder in making a smaller file for the file PAPER4.CG1 That the ARI compresses better than a Huffman coder is not a surprise since if all other things are equal a huffman coder is a subset of an Arithmetic coder. I could stop here since my task of making a BWT type of compressor for files that are to be encrypted has ended. And it compresses smaller than the orignal one sugguest by Mark. However it does not give one much satisfaction that the ARI usually compresses these Corpus files to a smaller size. What is one to do. SEE NEXT SECTION


Changing last stage to a 1-1 Arithmetic coder

We are in luck there is a 1-1 arithmetic compress. Matt Timmermans bijective compress SEE MATTS PAGE HERE and get source code and discussions of what he did.

Here are the results of that mode:


Using BWTQ DSC MTFQ RLEQ BIACODE

BIB      CF         28,798  09-20-00  1:16p BIB.cf
BOOK1    CF        242,197  09-20-00  1:16p BOOK1.cf
BOOK2    CF        164,877  09-20-00  1:17p BOOK2.cf
GEO      CF         60,523  09-20-00  1:17p GEO.cf
NEWS     CF        122,989  09-20-00  1:17p NEWS.cf
OBJ1     CF         10,724  09-20-00  1:17p OBJ1.cf
OBJ2     CF         79,294  09-20-00  1:17p OBJ2.cf
PAPER1   CF         17,131  09-20-00  1:17p PAPER1.cf
PAPER2   CF         25,989  09-20-00  1:17p PAPER2.cf
PAPER3   CF         16,351  09-20-00  1:17p PAPER3.cf
PAPER4   CF          5,234  09-20-00  1:17p PAPER4.cf
PAPER5   CF          4,848  09-20-00  1:17p PAPER5.cf
PAPER6   CF         12,670  09-20-00  1:17p PAPER6.cf
PIC      CF         51,416  09-20-00  1:17p PIC.cf
PROGC    CF         13,025  09-20-00  1:17p PROGC.cf
PROGL    CF         16,463  09-20-00  1:17p PROGL.cf
PROGP    CF         11,233  09-20-00  1:17p PROGP.cf
TRANS    CF         18,652  09-20-00  1:17p TRANS.cf
        18 file(s)        902,414 bytes

Notice this seems to compress much better than ARI. I hoped it would since it is not carrying extra space for an unneeded EOF character that the old style is using. So this is a better compression set for files that are to be encrypted and it seems to be a better compressor than ARI all the way around. ARI could be fixed to eliminate the wasteful EOF that it carries but for now one can use Matt's if they want more secure compression for files that are to be encrypted.


Future Directions

It is clear that if one use a raw BWT that a DSC type of program needs to be used after it. However there are ways to make the required pointer smaller so that the pointer value covers less range. It involves minor changes to the BWT and UNBWT to allow more than the single cycle of data. I may do that for next project since it seems easy to do and any forced reduction of N by one bit means that an attacker would get twice as many passable false files when one decrypts.

It is clear that there are many possibilites for the MTF RLE combinations and that different types should be selected if one wishes different versions of the programs for special files. It is not possible to make one version compress all files. Since every real compressor increases most files.

It is also clear that the last stage should be a compressor that does not need a specail EOF character since not only does the EOF character waste space but makes the job of an attacker much easier.


Summary of Results

ALL RESULTS COMPRESSION OF CALGARY CORPUS FILE SET

FROM MARKS WEBPAGE
PKZIP 1,071,986
BWTCODE 978,122

FROM MY TESTS
RLE BWT MTF RLE ARI 978,129
RLE BWTQ DSC MTF RLE ARI 926,760
BWTQ DSC MTFQ RLEQ ARI 912,571

ONLY THE FOLLOWING TWO SHOULD BE USED IF ONE COMPRESSING THEN ENCRYPTING
BWTQ DSC MTFQ RLEQ H2COM 938,554
BWTQ DSC MTFQ RLEQ BIACODE 902,414 ** the best of all worlds **


Conclusions

One it is possible to modify a typical BWT compressor such as Mark's to make it closer to bijective. It is not bijective yet but will not weaken the effective encryption key used by more than N where the number of bytes in the orginal file is 2**N

This means that quantum computers will not find a unique solution which until now was not possible. If one was using the old BWT compression there would like be only one solution making it much easier to break.

This close to Bijectiveness was done by the use of a DSC routine which combines the pointer object to the BWT transform in a optimal way. Then all the later stages are made 1-1 with the last stage being a bijective arithmetic coder.

The file bibwt1.zip contains the PC executables and batch files to do the last 3 types of compressions and their inverses for those wishing to test it.


HOMPAGE