David's View of BWTS (Burrows Wheeler Transform Scottifed)

last update 2010 jan 17

This page looks at a modified BWT compression package and suggests that the mods make for a better transform than the standard BWT. I called this bijective version a BWT that has been Scottified. It is what the BWT should have been. If one looks at only the transformed data without the index the old original BWT most likely leads to a better compression. But when you consider the need for an index to be passed along the BWTS transform will lead on the average to better compression.

the code for BWTS is HERE

Table of Contents

How does it work

One way to explain it. Is to contrast it with the old BWT that uses an index.
BANANAS is a clasic example in old way
you make a list of all possible rotations then sort the list smallest to largest The BWT is the last character in each row
The last column is BNNSAAA



Since any rotation of the word BANANAS sorts to same thing your have to pass an index along with the sorted string making it longer so the output would be

so that the file would actaully have an increase in length if you tack the index on.

BWTS for one roations gets exactuly the same string however there is no index
ANANASB the first row is the string that in BWTS matched by the raw BWT except no INDEX

For the next step look at UNBWT for the string BNNSAAA
sort it

    add output	sort  add ouput	sort ..REPEAT TILL

You know have same table you storted with so with old way and index
you get BANANAS with new way no index you get ANANASB

This shows for many strings the forward and reverse transformations the same in BWT UNBWT and BWTS and UNBWTS except you save index know to show what happens in other cases.
When I do a BWTS of BANANAS I get SNNBAAA no index
I claim that is based on how how I get the UNBWTS to go to BANANAS and then BWTS is just reverse of UNBWTS so explaining UNBWTS does it all

For the next step look at UNBWT for the string SNNBAAA
sort it

    add output	sort  add ouput	sort ..REPEAT TILL
B	BB	BB	BBB	BBB	(B)(B)(B)(B)(B)(B)(B)...

Note last charaacter in each cycle maps to the output SNNBAAA the BWTS output
When you create the BWT out you sort to get exaxt same rows above. Note in this case there are 2 cycles the cycle seen last is written first (B) the next to last cycle is (ANANAS) and so on if more till finally you get
BANANAS its as if the UNBWTS is what UNBWT should have been.
Thats all folks

Why is there a problem with BWT and what is BWTS

The main problem with BWT is that it adds unwanted information to file when the BWT is done. The result is that the file is always expanded and that an index must somehow be tacked on to the file. The above fact leads to problems if one ever wants to someday challenge the PPM style compressors which can be made bijective. BTW style of compressors can't be truely bijective and that means they will never in the long run compress as well as a good PPM style compressor. The main problem with BWT is that for any random file over a hundred bytes has in 99% of cases no possible inverse regardless of what index you might want to use. I for one believe in compression before encryption. As quantum computers become more in the main stream its foolish to have compression encryption methods that carry the info in the Shanon sense to the breaking of themselves. Better to use a bijective compressor where anything can be uncompressed. There are several bijective compressors out there and the best simple bijective compression encryption program I am aware of is BICOM
Before BWTS the closest one could obtain a bijective BWT was doing a standard full BWT followed by my DSC which not adds less bits than using Universal Numbers but sometimes shortened the file. I did BWTS since DSC still fell short of the goal of making a truely bijective BWT.

Why BWTS is better

BWTS is better for several reasons:
1. It is "truely bijective" which means any file can be thought of as a sorted file or unsorted file.
2. It gives one the chance to someday beating or tieing the best PPM compressors.
3. It allows for great compression before encryption

Cut the hype compare to old BWT

I compared it the the standard Marks code of the Calagary test files.
Yes this is old code even doing a RLE before the BWT.
Following the table is the bat file that generated the table. Notice the only difference between the runs is that where Mark used old nonbijective BWT I used the truely bijective BWTS. This compression is not bijective since the other stages Mark used are not bijective. But this is a first test of BWTS

MARKS BWT versus drop in BWTS
program original marks BWTS
BIB 111,261 29,567 29,563
BOOK1 768,771 275,831 275,779
BOOK2 610,856 186,592 186,555
GEO 102,400 62,120 62,113
NEWS 377,109 134,174 134,166
OBJ1 21,504 10,857 10,845
OBJ2 246,814 81,948 81,930
PAPER1 53,161 17,724 17,710
PAPER2 82,199 26,956 26,938
PAPER3 46,526 16,995 16,979
PAPER4 13,286 5,529 5,515
PAPER5 11,954 5,136 5,122
PAPER6 38,105 13,159 13,147
PIC 513,216 50,829 50,815
PROGC 39,611 13,312 13,302
PROGL 71,646 16,688 16,682
PROGP 49,379 11,404 11,395
TRANS 93,695 19,301 19,298
total = 3,251,493 978,122 977,854

HERE is the dos run stream to create the files
..\rle %1. %1.1
..\bwt %1.1 %1.2
..\mtf %1.2 %1.3
..\rle %1.3 %1.4
..\ari %1.4 %1.m
..\unari %1.m %1.4m
..\unrle %1.4m %1.3m
..\unmtf %1.3m %1.2m
..\unbwt %1.2m %1.1m
..\unrle %1.1m %1.mm
..\bwts -b200000 %1.1 %1.5
..\mtf %1.5 %1.6
..\rle %1.6 %1.7
..\ari %1.7 %1.s
..\unari %1.s %1.4s
..\unrle %1.4s %1.3s
..\unmtf %1.3s %1.2s
..\unbwts  -b200000 %1.2s %1.1s
..\unrle %1.1s %1.ss

fc /b %1. %1.mm >> arb.log
fc /b %1. %1.ss >> arb.log

This is discusion is based on Mark' BWTCODE.ZIP the file and Mark's Page is here

Do you have a truely bijective BWT compression program

Yes I do I have several But to leave off with a simple easy to follow here is a simple one that uses old off the shelf components, The one list only scratches the surface.

here is CALGORY with simple bijective stages
            29,060 BIB
          252,700 BOOK1
          170,573 BOOK2
            62,784 GEO
           125,020 NEWS
            10,677 OBJ1
            79,508 OBJ2
            17,319 PAPER1
            26,447 PAPER2
            16,544 PAPER3
             5,236 PAPER4
             4,848 PAPER5
            12,735 PAPER6
            53,781 PIC
            13,020 PROGC
            16,515 PROGL
            11,259 PROGP
            18,589 TRANS
18 File(s)        926,615 bytes

The bat file used above is:

g:bwts %1. %1.1
g:mtfq %1.1 %1.2
g:rleq %1.2 %1.3
g:arb255 %1.3 %1.4
g:unarb255 %1.4 %1.5
g:unrleq %1.5 %1.6
g:unmtfq %1.6 %1.7
g:unbwts %1.7 %1.8

fc /b %1. %1.8 >> arb.log

the code for MTFQ and etc...is HERE 

taking the simple 'hello word" test file and uncompressing uing unarb255 unrleq unmtf unbwts 0000 48 45 4C 4C 4F 20 57 4F 52 4C 44 21 . . . . *HELLO WORLD!* number of bytes is 12 the result is this 0000 E2 FB C3 81 26 F6 E1 B8 F7 C0 FA FA A8 . . . *....&........* number of bytes is 13 compressing gets original file back

  • Do you have a truely bijective BWTcompression program
  • Do you have a truely bijective binary( 2 states) BWTcompression program

    Do you have a truely bijective binary( 2 states) BWTcompression program

    Yes I do I have several But to leave off with a simple easy to follow here is a simple one that uses old off the shelf components, except for arb2xx, The one list only scratches the surface.

    here is CALGORY with simple bijective stages
    I compare it with the raw length R
    and Daneil A. Nagy's 2 state binary BWT coder N
     I only have his results so it can not be
     verifed It was done for a Phd at Queen's
     University Ontario Canda. January 2006
     The thesis was found on the net.
    My bijective coder B
                R       N       B
    BIB       111,261  32,022  31,197
    BOOK1     768,771 242,857 235,913
    BOOK2     610,856 170,783 166,881
    GEO       102,400  66,370  66,932
    NEWS      377,109 135,444 131,944
    OBJ1       21,504  12,727  12,640
    OBJ2      246,814  98,395  94,565
    PAPER1     53,161  19,816  18,931
    PAPER2     82,199  28,084  27,242 
    PAPER3     46,526  18,124  17,511
    PAPER4     13,286   6,047   5,920
    PAPER5     11,954   5,815   5,670
    PAPER6     38,105  14,786  14,282
    PIC       513,216  59,131  52,406
    PROGC      39,611  15,320  14,774
    PROGL      71,646  18,101  17,916
    PROGP      49,379  13,336  13,010
    TRANS      93,695  22,864  22,356
    TOTALS  3,251,493 980,022 950,090
    For a comparision Marks old reference code for
    the Calgary got 978,122  bytes.
     I do not have the code for  Nagy's BWT but he
    does describe it. It seems a clasical nonbijective
    BWT on a binary string. With Classical BWT on a
    string you have to add information which prevents
    it being bijective. The BWTS transform is totally
    bijective no extra data so length of string does
    not change.
     This would make a good compressor to use
    before encryption since it makes total use
    of the file space. 
      If BWTS really is a bijective indexless BWT
    then the compression should be close to a classical
    BWT compression. Both he and I sought to try to
    compress the Calgary Corpus as strings.
      I think I have a poor mans RLE coder  it was a
    quick and dirty mode to ARB255 the classical
    256 state bijective arithmetical coder. I think
    there is lots of room for improvement in this 
    code to make a better RLE encoder.
     the following is code from computer run
    to test the binary 2 state BWTS to see
    how close it matches a old style BWT 2 state
    06/11/2008  11:24 AM            31,197 BIB.4
    06/11/2008  11:24 AM           235,913 BOOK1.4
    06/11/2008  11:25 AM           166,881 BOOK2.4
    06/11/2008  11:25 AM            66,932 GEO.4
    06/11/2008  11:25 AM           131,944 NEWS.4
    06/11/2008  11:25 AM            12,640 OBJ1.4
    06/11/2008  11:26 AM            94,565 OBJ2.4
    06/11/2008  11:26 AM            18,931 PAPER1.4
    06/11/2008  11:26 AM            27,242 PAPER2.4
    06/11/2008  11:26 AM            17,511 PAPER3.4
    06/11/2008  11:26 AM             5,920 PAPER4.4
    06/11/2008  11:26 AM             5,670 PAPER5.4
    06/11/2008  11:26 AM            14,282 PAPER6.4
    06/11/2008  11:26 AM            52,406 PIC.4
    06/11/2008  11:26 AM            14,774 PROGC.4
    06/11/2008  11:26 AM            17,916 PROGL.4
    06/11/2008  11:26 AM            13,010 PROGP.4
    06/11/2008  11:26 AM            22,356 TRANS.4
                  18 File(s)        950,090 bytes
    here is the main bat file to do the run
    using a series of simple bijective stages
    to compress and then uncompress
    b012a01 %1. %1.0
    bwts -B %1.0 %1.1
    mtfq %1.1 %1.2
    a012b01 %1.2 %1.3
    arb2xx %1.3 %1.4
    unarb2xx %1.4 %1.5
    b012a01 %1.5 %1.6
    unmtfq %1.6 %1.7
    unbwts -B %1.7 %1.8
    a012b01 %1.8 %1.9
    fc /b %1.  %1.9 >> arb.log
    b012a01 converts any file into a string of 
     asci characters of ones and zeroes Note this
     is bijective so string often not a mulitple of 8.
    the code for a012b01 and b012a01 is HERE 

    bwts this famous BIJECTIVE BWT SCOTTIFED transform on average will not create a file as good as old BWT if you dropped its index. But if one use old BWT you need the index that added space makes the old BWT not quite as good as the bijective BWTS for most files. mtfq its my old MTF very simple most likely not the best. a012b01 Converts the string of Ascii zeroes and ones back to binary file bijectively. arb2xx the only new code a quick and dirty mod to arb255 to make the bijective rle for the test of seeing if BWTS may truely be good thing to use instead of a nonbijective classical BWT rest is just reverse of above. Hey it works on my machine it may not yours use in a directory by itself. the code for ARB2XX and etc...is HERE


    David A. Scott, biject.bwts@gmail.com