the code for BWTS is HERE
One way to explain it. Is to contrast it with the old BWT that
uses an index.
BANANAS is a clasic example in old way
you make a list of all possible rotations then sort the list smallest to largest The BWT is the last character in each row
The last column is BNNSAAA
Since any rotation of the word BANANAS sorts to same thing your have to pass an index along with the sorted string making it longer so the output would be
so that the file would actaully have an increase in length if you tack the index on.
BWTS for one roations gets exactuly the same string however there is no index
ANANASB the first row is the string that in BWTS matched by the raw BWT except no INDEX
For the next step look at UNBWT for the string BNNSAAA
add output sort add ouput sort ..REPEAT TILL A BA AN BAN ANA ANANASB A NA AN NAN ANA ANASBAN A NA AS NAS ASB ASBANAN B SB BA SBA BAN BANANAS N AN NA ANA NAN NANASBA N AN NA ANA NAS NASBANA S AS SB ASB SBA SBANANA
This shows for many strings the forward and reverse transformations the
same in BWT UNBWT and BWTS and UNBWTS except you save index
know to show what happens in other cases.
When I do a BWTS of BANANAS I get SNNBAAA no index
I claim that is based on how how I get the UNBWTS to go to BANANAS and then BWTS is just reverse of UNBWTS so explaining UNBWTS does it all
For the next step look at UNBWT for the string SNNBAAA
add output sort add ouput sort ..REPEAT TILL A SA AN SAN ANA (ANANAS)(ANANAS)... A NA AN NAN ANA (ANASAN)(ANASAN)... A NA AS NAS ASB (ASANAN)(ASANAN) B BB BB BBB BBB (B)(B)(B)(B)(B)(B)(B)... N AN NA ANA NAN (NANASA)(NANASA)... N AN NA ANA NAS (NASANA)(NASANA)... S AS SA ASA SAN SANANA)(SANANA)...
The main problem with BWT is that it adds unwanted information to file
when the BWT is done. The result is that the file is always expanded and
that an index must somehow be tacked on to the file.
The above fact leads to problems if one ever wants to someday
challenge the PPM style compressors which can be made bijective.
BTW style of compressors can't be truely bijective and that means
they will never in the long run compress as well as a good PPM
The main problem with BWT is that for any random file over
a hundred bytes has in 99% of cases no possible inverse regardless of
what index you might want to use.
I for one believe in compression before encryption. As quantum computers
become more in the main stream its foolish to have compression encryption
methods that carry the info in the Shanon sense to the breaking of themselves.
Better to use a bijective compressor where anything can be uncompressed.
There are several bijective compressors out there and the best simple bijective
compression encryption program I am aware of is
Before BWTS the closest one could obtain a bijective BWT was doing a standard full BWT followed by my DSC which not adds less bits than using Universal Numbers but sometimes shortened the file. I did BWTS since DSC still fell short of the goal of making a truely bijective BWT.
HERE is the dos run stream to create the files ..\rle %1. %1.1 ..\bwt %1.1 %1.2 ..\mtf %1.2 %1.3 ..\rle %1.3 %1.4 ..\ari %1.4 %1.m ..\unari %1.m %1.4m ..\unrle %1.4m %1.3m ..\unmtf %1.3m %1.2m ..\unbwt %1.2m %1.1m ..\unrle %1.1m %1.mm ..\bwts -b200000 %1.1 %1.5 ..\mtf %1.5 %1.6 ..\rle %1.6 %1.7 ..\ari %1.7 %1.s ..\unari %1.s %1.4s ..\unrle %1.4s %1.3s ..\unmtf %1.3s %1.2s ..\unbwts -b200000 %1.2s %1.1s ..\unrle %1.1s %1.ss fc /b %1. %1.mm >> arb.log fc /b %1. %1.ss >> arb.log
This is discusion is based on Mark' BWTCODE.ZIP the file and Mark's Page is here
here is CALGORY with simple bijective stages 29,060 BIB 252,700 BOOK1 170,573 BOOK2 62,784 GEO 125,020 NEWS 10,677 OBJ1 79,508 OBJ2 17,319 PAPER1 26,447 PAPER2 16,544 PAPER3 5,236 PAPER4 4,848 PAPER5 12,735 PAPER6 53,781 PIC 13,020 PROGC 16,515 PROGL 11,259 PROGP 18,589 TRANS 18 File(s) 926,615 bytes The bat file used above is: g:bwts %1. %1.1 g:mtfq %1.1 %1.2 g:rleq %1.2 %1.3 g:arb255 %1.3 %1.4 g:unarb255 %1.4 %1.5 g:unrleq %1.5 %1.6 g:unmtfq %1.6 %1.7 g:unbwts %1.7 %1.8 fc /b %1. %1.8 >> arb.log the code for MTFQ and etc...is HERE
taking the simple 'hello word" test file and uncompressing uing unarb255 unrleq unmtf unbwts 0000 48 45 4C 4C 4F 20 57 4F 52 4C 44 21 . . . . *HELLO WORLD!* number of bytes is 12 the result is this 0000 E2 FB C3 81 26 F6 E1 B8 F7 C0 FA FA A8 . . . *....&........* number of bytes is 13 compressing gets original file back
here is CALGORY with simple bijective stages I compare it with the raw length R and Daneil A. Nagy's 2 state binary BWT coder N I only have his results so it can not be verifed It was done for a Phd at Queen's University Ontario Canda. January 2006 The thesis was found on the net. My bijective coder B R N B BIB 111,261 32,022 31,197 BOOK1 768,771 242,857 235,913 BOOK2 610,856 170,783 166,881 GEO 102,400 66,370 66,932 NEWS 377,109 135,444 131,944 OBJ1 21,504 12,727 12,640 OBJ2 246,814 98,395 94,565 PAPER1 53,161 19,816 18,931 PAPER2 82,199 28,084 27,242 PAPER3 46,526 18,124 17,511 PAPER4 13,286 6,047 5,920 PAPER5 11,954 5,815 5,670 PAPER6 38,105 14,786 14,282 PIC 513,216 59,131 52,406 PROGC 39,611 15,320 14,774 PROGL 71,646 18,101 17,916 PROGP 49,379 13,336 13,010 TRANS 93,695 22,864 22,356 TOTALS 3,251,493 980,022 950,090 For a comparision Marks old reference code for the Calgary got 978,122 bytes. I do not have the code for Nagy's BWT but he does describe it. It seems a clasical nonbijective BWT on a binary string. With Classical BWT on a string you have to add information which prevents it being bijective. The BWTS transform is totally bijective no extra data so length of string does not change. This would make a good compressor to use before encryption since it makes total use of the file space. If BWTS really is a bijective indexless BWT then the compression should be close to a classical BWT compression. Both he and I sought to try to compress the Calgary Corpus as strings. I think I have a poor mans RLE coder it was a quick and dirty mode to ARB255 the classical 256 state bijective arithmetical coder. I think there is lots of room for improvement in this code to make a better RLE encoder. the following is code from computer run to test the binary 2 state BWTS to see how close it matches a old style BWT 2 state transformer 06/11/2008 11:24 AM 31,197 BIB.4 06/11/2008 11:24 AM 235,913 BOOK1.4 06/11/2008 11:25 AM 166,881 BOOK2.4 06/11/2008 11:25 AM 66,932 GEO.4 06/11/2008 11:25 AM 131,944 NEWS.4 06/11/2008 11:25 AM 12,640 OBJ1.4 06/11/2008 11:26 AM 94,565 OBJ2.4 06/11/2008 11:26 AM 18,931 PAPER1.4 06/11/2008 11:26 AM 27,242 PAPER2.4 06/11/2008 11:26 AM 17,511 PAPER3.4 06/11/2008 11:26 AM 5,920 PAPER4.4 06/11/2008 11:26 AM 5,670 PAPER5.4 06/11/2008 11:26 AM 14,282 PAPER6.4 06/11/2008 11:26 AM 52,406 PIC.4 06/11/2008 11:26 AM 14,774 PROGC.4 06/11/2008 11:26 AM 17,916 PROGL.4 06/11/2008 11:26 AM 13,010 PROGP.4 06/11/2008 11:26 AM 22,356 TRANS.4 18 File(s) 950,090 bytes here is the main bat file to do the run using a series of simple bijective stages to compress and then uncompress b012a01 %1. %1.0 bwts -B %1.0 %1.1 mtfq %1.1 %1.2 a012b01 %1.2 %1.3 arb2xx %1.3 %1.4 unarb2xx %1.4 %1.5 b012a01 %1.5 %1.6 unmtfq %1.6 %1.7 unbwts -B %1.7 %1.8 a012b01 %1.8 %1.9 fc /b %1. %1.9 >> arb.log b012a01 converts any file into a string of asci characters of ones and zeroes Note this is bijective so string often not a mulitple of 8. the code for a012b01 and b012a01 is HERE
bwts this famous BIJECTIVE BWT SCOTTIFED transform on average will not create a file as good as old BWT if you dropped its index. But if one use old BWT you need the index that added space makes the old BWT not quite as good as the bijective BWTS for most files. mtfq its my old MTF very simple most likely not the best. a012b01 Converts the string of Ascii zeroes and ones back to binary file bijectively. arb2xx the only new code a quick and dirty mod to arb255 to make the bijective rle for the test of seeing if BWTS may truely be good thing to use instead of a nonbijective classical BWT rest is just reverse of above. Hey it works on my machine it may not yours use in a directory by itself. the code for ARB2XX and etc...is HERE
David A. Scott, firstname.lastname@example.org