David's View of BWTS (Burrows Wheeler Transform Scottifed)


last update 2008 jan 07

This page looks at a modified BWT compression package and suggests that the mods make for a better transform than the standard BWT. I called this bijective version a BWT that has been Scottified. It is what the BWT should have been. If one looks at only the transformed data without the index the old original BWT most likely leads to a better compression. But when you consider the need for an index to be passed along the BWTS transform will lead on the average to better compression.

the code for BWTS is HERE

Table of Contents


How does it work

One way to explain it. Is to contrast it with the old BWT that uses an index.
BANANAS is a clasic example in old way
you make a list of all possible rotations then sort the list smallest to largest The BWT is the last character in each row
The last column is BNNSAAA

BANANAS

ANANASB
ANASBAN
ASBANAN
BANANAS
NANASBA
NASBANA
SBANANA

Since any rotation of the word BANANAS sorts to same thing your have to pass an index along with the sorted string making it longer so the output would be

BNNSAAA[index]
so that the file would actaully have an increase in length if you tack the index on.

BWTS for one roations gets exactuly the same string however there is no index
ANANASB the first row is the string that in BWTS matched by the raw BWT except no INDEX

For the next step look at UNBWT for the string BNNSAAA
sort it


    add output	sort  add ouput	sort ..REPEAT TILL
A	BA	AN	BAN	ANA	ANANASB
A	NA	AN	NAN	ANA	ANASBAN
A	NA	AS	NAS	ASB	ASBANAN
B	SB	BA	SBA	BAN	BANANAS
N	AN	NA	ANA	NAN	NANASBA
N	AN	NA	ANA	NAS	NASBANA
S	AS	SB	ASB	SBA	SBANANA

You know have same table you storted with so with old way and index
you get BANANAS with new way no index you get ANANASB

This shows for many strings the forward and reverse transformations the same in BWT UNBWT and BWTS and UNBWTS except you save index know to show what happens in other cases.
When I do a BWTS of BANANAS I get SNNBAAA no index
I claim that is based on how how I get the UNBWTS to go to BANANAS and then BWTS is just reverse of UNBWTS so explaining UNBWTS does it all

For the next step look at UNBWT for the string SNNBAAA
sort it


    add output	sort  add ouput	sort ..REPEAT TILL
A	SA	AN	SAN	ANA	(ANANAS)(ANANAS)...
A	NA	AN	NAN	ANA	(ANASAN)(ANASAN)...
A	NA	AS	NAS	ASB	(ASANAN)(ASANAN)
B	BB	BB	BBB	BBB	(B)(B)(B)(B)(B)(B)(B)...
N	AN	NA	ANA	NAN	(NANASA)(NANASA)...
N	AN	NA	ANA	NAS	(NASANA)(NASANA)...
S	AS	SA	ASA	SAN	SANANA)(SANANA)...

Note last charaacter in each cycle maps to the output SNNBAAA the BWTS output
When you create the BWT out you sort to get exaxt same rows above. Note in this case there are 2 cycles the cycle seen last is written first (B) the next to last cycle is (ANANAS) and so on if more till finally you get
BANANAS its as if the UNBWTS is what UNBWT should have been.
Thats all folks


Why is there a problem with BWT and what is BWTS

The main problem with BWT is that it adds unwanted information to file when the BWT is done. The result is that the file is always expanded and that an index must somehow be tacked on to the file. The above fact leads to problems if one ever wants to someday challenge the PPM style compressors which can be made bijective. BTW style of compressors can't be truely bijective and that means they will never in the long run compress as well as a good PPM style compressor. The main problem with BWT is that for any random file over a hundred bytes has in 99% of cases no possible inverse regardless of what index you might want to use. I for one believe in compression before encryption. As quantum computers become more in the main stream its foolish to have compression encryption methods that carry the info in the Shanon sense to the breaking of themselves. Better to use a bijective compressor where anything can be uncompressed. There are several bijective compressors out there and the best simple bijective compression encryption program I am aware of is BICOM
Before BWTS the closest one could obtain a bijective BWT was doing a standard full BWT followed by my DSC which not adds less bits than using Universal Numbers but sometimes shortened the file. I did BWTS since DSC still fell short of the goal of making a truely bijective BWT.


Why BWTS is better

BWTS is better for several reasons:
1. It is "truely bijective" which means any file can be thought of as a sorted file or unsorted file.
2. It gives one the chance to someday beating or tieing the best PPM compressors.
3. It allows for great compression before encryption


Cut the hype compare to old BWT

I compared it the the standard Marks code of the Calagary test files.
Yes this is old code even doing a RLE before the BWT.
Following the table is the bat file that generated the table. Notice the only difference between the runs is that where Mark used old nonbijective BWT I used the truely bijective BWTS. This compression is not bijective since the other stages Mark used are not bijective. But this is a first test of BWTS

MARKS BWT versus drop in BWTS
program original marks BWTS
BIB 111,261 29,567 29,563
BOOK1 768,771 275,831 275,779
BOOK2 610,856 186,592 186,555
GEO 102,400 62,120 62,113
NEWS 377,109 134,174 134,166
OBJ1 21,504 10,857 10,845
OBJ2 246,814 81,948 81,930
PAPER1 53,161 17,724 17,710
PAPER2 82,199 26,956 26,938
PAPER3 46,526 16,995 16,979
PAPER4 13,286 5,529 5,515
PAPER5 11,954 5,136 5,122
PAPER6 38,105 13,159 13,147
PIC 513,216 50,829 50,815
PROGC 39,611 13,312 13,302
PROGL 71,646 16,688 16,682
PROGP 49,379 11,404 11,395
TRANS 93,695 19,301 19,298
total = 3,251,493 978,122 977,854

HERE is the dos run stream to create the files
..\rle %1. %1.1
..\bwt %1.1 %1.2
..\mtf %1.2 %1.3
..\rle %1.3 %1.4
..\ari %1.4 %1.m
..\unari %1.m %1.4m
..\unrle %1.4m %1.3m
..\unmtf %1.3m %1.2m
..\unbwt %1.2m %1.1m
..\unrle %1.1m %1.mm
..\bwts -b200000 %1.1 %1.5
..\mtf %1.5 %1.6
..\rle %1.6 %1.7
..\ari %1.7 %1.s
..\unari %1.s %1.4s
..\unrle %1.4s %1.3s
..\unmtf %1.3s %1.2s
..\unbwts  -b200000 %1.2s %1.1s
..\unrle %1.1s %1.ss


fc /b %1. %1.mm >> arb.log
fc /b %1. %1.ss >> arb.log



This is discusion is based on Mark' BWTCODE.ZIP the file and Mark's Page is here


Do you have a truely bijective BWT compression program

Yes I do I have several But to leave off with a simple easy to follow here is a simple one that uses old off the shelf components, The one list only scratches the surface.

here is CALGORY with simple bijective stages
            29,060 BIB
          252,700 BOOK1
          170,573 BOOK2
            62,784 GEO
           125,020 NEWS
            10,677 OBJ1
            79,508 OBJ2
            17,319 PAPER1
            26,447 PAPER2
            16,544 PAPER3
             5,236 PAPER4
             4,848 PAPER5
            12,735 PAPER6
            53,781 PIC
            13,020 PROGC
            16,515 PROGL
            11,259 PROGP
            18,589 TRANS
18 File(s)        926,615 bytes
             

The bat file used above is:

g:bwts %1. %1.1
g:mtfq %1.1 %1.2
g:rleq %1.2 %1.3
g:arb255 %1.3 %1.4
g:unarb255 %1.4 %1.5
g:unrleq %1.5 %1.6
g:unmtfq %1.6 %1.7
g:unbwts %1.7 %1.8


fc /b %1. %1.8 >> arb.log

the code for MTFQ and etc...is HERE 

taking the simple 'hello word" test file and uncompressing uing unarb255 unrleq unmtf unbwts 0000 48 45 4C 4C 4F 20 57 4F 52 4C 44 21 . . . . *HELLO WORLD!* number of bytes is 12 the result is this 0000 E2 FB C3 81 26 F6 E1 B8 F7 C0 FA FA A8 . . . *....&........* number of bytes is 13 compressing gets original file back

MORE LATER UNDER CONSTRUCTION

David A. Scott, biject.bwts@gmail.com

HOMPAGE