the code for BWTS is HERE
One way to explain it. Is to contrast it with the old BWT that
uses an index.
BANANAS is a clasic example in old way
you make a list of all possible rotations then sort
the list smallest to largest The BWT is the last character in each row
The last column is BNNSAAA
BANANAS
ANANASB
ANASBAN
ASBANAN
BANANAS
NANASBA
NASBANA
SBANANA
Since any rotation of the word BANANAS sorts to same thing your have to pass an index along with the sorted string making it longer so the output would be
BNNSAAA[index]
so that the file would actaully have an increase in length if you tack the index on.
BWTS for one roations gets exactuly the same string however there is no index
ANANASB the first row is the string that in BWTS matched by the raw BWT except no INDEX
For the next step look at UNBWT for the string BNNSAAA
sort it
add output sort add ouput sort ..REPEAT TILL
A BA AN BAN ANA ANANASB
A NA AN NAN ANA ANASBAN
A NA AS NAS ASB ASBANAN
B SB BA SBA BAN BANANAS
N AN NA ANA NAN NANASBA
N AN NA ANA NAS NASBANA
S AS SB ASB SBA SBANANA
This shows for many strings the forward and reverse transformations the
same in BWT UNBWT and BWTS and UNBWTS except you save index
know to show what happens in other cases.
When I do a BWTS of BANANAS I get SNNBAAA no index
I claim that is based on how how I get the UNBWTS to go to
BANANAS and then BWTS is just reverse of UNBWTS so explaining
UNBWTS does it all
For the next step look at UNBWT for the string SNNBAAA
sort it
add output sort add ouput sort ..REPEAT TILL
A SA AN SAN ANA (ANANAS)(ANANAS)...
A NA AN NAN ANA (ANASAN)(ANASAN)...
A NA AS NAS ASB (ASANAN)(ASANAN)
B BB BB BBB BBB (B)(B)(B)(B)(B)(B)(B)...
N AN NA ANA NAN (NANASA)(NANASA)...
N AN NA ANA NAS (NASANA)(NASANA)...
S AS SA ASA SAN SANANA)(SANANA)...
The main problem with BWT is that it adds unwanted information to file
when the BWT is done. The result is that the file is always expanded and
that an index must somehow be tacked on to the file.
The above fact leads to problems if one ever wants to someday
challenge the PPM style compressors which can be made bijective.
BTW style of compressors can't be truely bijective and that means
they will never in the long run compress as well as a good PPM
style compressor.
The main problem with BWT is that for any random file over
a hundred bytes has in 99% of cases no possible inverse regardless of
what index you might want to use.
I for one believe in compression before encryption. As quantum computers
become more in the main stream its foolish to have compression encryption
methods that carry the info in the Shanon sense to the breaking of themselves.
Better to use a bijective compressor where anything can be uncompressed.
There are several bijective compressors out there and the best simple bijective
compression encryption program I am aware of is
BICOM
Before BWTS the closest one could obtain a bijective BWT was doing a
standard full BWT followed by my DSC which not adds less bits than using
Universal Numbers but sometimes shortened the file. I did BWTS since DSC
still fell short of the goal of making a truely bijective BWT.
| program | original | marks | BWTS |
| BIB | 111,261 | 29,567 | 29,563 |
| BOOK1 | 768,771 | 275,831 | 275,779 |
| BOOK2 | 610,856 | 186,592 | 186,555 |
| GEO | 102,400 | 62,120 | 62,113 |
| NEWS | 377,109 | 134,174 | 134,166 |
| OBJ1 | 21,504 | 10,857 | 10,845 |
| OBJ2 | 246,814 | 81,948 | 81,930 |
| PAPER1 | 53,161 | 17,724 | 17,710 |
| PAPER2 | 82,199 | 26,956 | 26,938 |
| PAPER3 | 46,526 | 16,995 | 16,979 |
| PAPER4 | 13,286 | 5,529 | 5,515 |
| PAPER5 | 11,954 | 5,136 | 5,122 |
| PAPER6 | 38,105 | 13,159 | 13,147 |
| PIC | 513,216 | 50,829 | 50,815 |
| PROGC | 39,611 | 13,312 | 13,302 |
| PROGL | 71,646 | 16,688 | 16,682 |
| PROGP | 49,379 | 11,404 | 11,395 |
| TRANS | 93,695 | 19,301 | 19,298 |
| total = | 3,251,493 | 978,122 | 977,854 |
HERE is the dos run stream to create the files ..\rle %1. %1.1 ..\bwt %1.1 %1.2 ..\mtf %1.2 %1.3 ..\rle %1.3 %1.4 ..\ari %1.4 %1.m ..\unari %1.m %1.4m ..\unrle %1.4m %1.3m ..\unmtf %1.3m %1.2m ..\unbwt %1.2m %1.1m ..\unrle %1.1m %1.mm ..\bwts -b200000 %1.1 %1.5 ..\mtf %1.5 %1.6 ..\rle %1.6 %1.7 ..\ari %1.7 %1.s ..\unari %1.s %1.4s ..\unrle %1.4s %1.3s ..\unmtf %1.3s %1.2s ..\unbwts -b200000 %1.2s %1.1s ..\unrle %1.1s %1.ss fc /b %1. %1.mm >> arb.log fc /b %1. %1.ss >> arb.log
This is discusion is based on Mark' BWTCODE.ZIP the file and Mark's
Page is here
here is CALGORY with simple bijective stages
29,060 BIB
252,700 BOOK1
170,573 BOOK2
62,784 GEO
125,020 NEWS
10,677 OBJ1
79,508 OBJ2
17,319 PAPER1
26,447 PAPER2
16,544 PAPER3
5,236 PAPER4
4,848 PAPER5
12,735 PAPER6
53,781 PIC
13,020 PROGC
16,515 PROGL
11,259 PROGP
18,589 TRANS
18 File(s) 926,615 bytes
The bat file used above is:
g:bwts %1. %1.1
g:mtfq %1.1 %1.2
g:rleq %1.2 %1.3
g:arb255 %1.3 %1.4
g:unarb255 %1.4 %1.5
g:unrleq %1.5 %1.6
g:unmtfq %1.6 %1.7
g:unbwts %1.7 %1.8
fc /b %1. %1.8 >> arb.log
the code for MTFQ and etc...is HERE
taking the simple 'hello word" test file and uncompressing uing
unarb255
unrleq
unmtf
unbwts
0000 48 45 4C 4C 4F 20 57 4F 52 4C 44 21 . . . . *HELLO WORLD!*
number of bytes is 12
the result is this
0000 E2 FB C3 81 26 F6 E1 B8 F7 C0 FA FA A8 . . . *....&........*
number of bytes is 13
compressing gets original file back
here is CALGORY with simple bijective stages
I compare it with the raw length R
and Daneil A. Nagy's 2 state binary BWT coder N
I only have his results so it can not be
verifed It was done for a Phd at Queen's
University Ontario Canda. January 2006
The thesis was found on the net.
My bijective coder B
R N B
BIB 111,261 32,022 31,197
BOOK1 768,771 242,857 235,913
BOOK2 610,856 170,783 166,881
GEO 102,400 66,370 66,932
NEWS 377,109 135,444 131,944
OBJ1 21,504 12,727 12,640
OBJ2 246,814 98,395 94,565
PAPER1 53,161 19,816 18,931
PAPER2 82,199 28,084 27,242
PAPER3 46,526 18,124 17,511
PAPER4 13,286 6,047 5,920
PAPER5 11,954 5,815 5,670
PAPER6 38,105 14,786 14,282
PIC 513,216 59,131 52,406
PROGC 39,611 15,320 14,774
PROGL 71,646 18,101 17,916
PROGP 49,379 13,336 13,010
TRANS 93,695 22,864 22,356
TOTALS 3,251,493 980,022 950,090
For a comparision Marks old reference code for
the Calgary got 978,122 bytes.
I do not have the code for Nagy's BWT but he
does describe it. It seems a clasical nonbijective
BWT on a binary string. With Classical BWT on a
string you have to add information which prevents
it being bijective. The BWTS transform is totally
bijective no extra data so length of string does
not change.
This would make a good compressor to use
before encryption since it makes total use
of the file space.
If BWTS really is a bijective indexless BWT
then the compression should be close to a classical
BWT compression. Both he and I sought to try to
compress the Calgary Corpus as strings.
I think I have a poor mans RLE coder it was a
quick and dirty mode to ARB255 the classical
256 state bijective arithmetical coder. I think
there is lots of room for improvement in this
code to make a better RLE encoder.
the following is code from computer run
to test the binary 2 state BWTS to see
how close it matches a old style BWT 2 state
transformer
06/11/2008 11:24 AM 31,197 BIB.4
06/11/2008 11:24 AM 235,913 BOOK1.4
06/11/2008 11:25 AM 166,881 BOOK2.4
06/11/2008 11:25 AM 66,932 GEO.4
06/11/2008 11:25 AM 131,944 NEWS.4
06/11/2008 11:25 AM 12,640 OBJ1.4
06/11/2008 11:26 AM 94,565 OBJ2.4
06/11/2008 11:26 AM 18,931 PAPER1.4
06/11/2008 11:26 AM 27,242 PAPER2.4
06/11/2008 11:26 AM 17,511 PAPER3.4
06/11/2008 11:26 AM 5,920 PAPER4.4
06/11/2008 11:26 AM 5,670 PAPER5.4
06/11/2008 11:26 AM 14,282 PAPER6.4
06/11/2008 11:26 AM 52,406 PIC.4
06/11/2008 11:26 AM 14,774 PROGC.4
06/11/2008 11:26 AM 17,916 PROGL.4
06/11/2008 11:26 AM 13,010 PROGP.4
06/11/2008 11:26 AM 22,356 TRANS.4
18 File(s) 950,090 bytes
here is the main bat file to do the run
using a series of simple bijective stages
to compress and then uncompress
b012a01 %1. %1.0
bwts -B %1.0 %1.1
mtfq %1.1 %1.2
a012b01 %1.2 %1.3
arb2xx %1.3 %1.4
unarb2xx %1.4 %1.5
b012a01 %1.5 %1.6
unmtfq %1.6 %1.7
unbwts -B %1.7 %1.8
a012b01 %1.8 %1.9
fc /b %1. %1.9 >> arb.log
b012a01 converts any file into a string of
asci characters of ones and zeroes Note this
is bijective so string often not a mulitple of 8.
the code for a012b01 and b012a01 is HERE
bwts this famous BIJECTIVE BWT SCOTTIFED
transform on average will not create a file as
good as old BWT if you dropped its index. But
if one use old BWT you need the index that added
space makes the old BWT not quite as good as
the bijective BWTS for most files.
mtfq its my old MTF very simple most likely not the
best.
a012b01 Converts the string of Ascii zeroes and ones
back to binary file bijectively.
arb2xx the only new code a quick and dirty mod to
arb255 to make the bijective rle for the test of
seeing if BWTS may truely be good thing to use
instead of a nonbijective classical BWT
rest is just reverse of above.
Hey it works on my machine it may not yours
use in a directory by itself.
the code for ARB2XX and etc...is HERE
David A. Scott, biject.bwts@gmail.com