Definition

The transform of Burrows-Wheeler, usually called BWT (for Burrows-Wheeler Transform ) is a technique of Data compression. She was invented by Michael Burrows and David Wheeler. This technique was made public in 1994, following preceding work of Wheeler in 1983. It is not a question strictly speaking of an algorithm of compression, because no reduction of size is carried out, on the contrary (see below), but well of a method of reorganization of the data: the probabilities so that identical characters initially distant from/to each other find side by side are then increased. This technique is not very much used, but one can however notice that it is present in the format Bzip2 which is currently one of the formats offering the greatest compression ratio.

Operation

As we said, the transform of Burrows-Wheeler does not compress the data, it is satisfied to reorganize them so as to obtain a smaller compression ratio.

First of all, the character string to be coded must be copied in a square table by shifting the chain of a character towards the line with each new line. These lines are then classified alphabetically. We know that, thanks to the shift, each last letter of each line precedes the first letter by the same line, except for the original line which one will note the position. Moreover, as the lines are arranged alphabetically, one can find the first column of the table thanks to the last column.

Let us take an example. Let us suppose that the chain to be coded is “TEXT”. The table first of all is carried out.

 POSITION CHAINS
 1 T E X T E
 2nd T E X T
 3 T E T E X
 4 X T E T E
 5th X T E T
Then one classifies these chains alphabetically:
 POSITION CHAINS
               0 1 2 3 4
 1 (2) E T E X T
 2 (5) E X T E T
 3 (3) T E T E X
 4 (1) T E X T E
 5 (4) X T E T E
The coded text is the last column preceded by its number, that is to say: “4TTXEE”. For decompression, it is necessary to keep in memory the position this position, here 4.

This transformation does not bring any immediate profit of compression, on the contrary, because it is necessary to transmit extra informations for decoding. However, Burrows and Wheeler then recommend to use an algorithm of the type MTF. Thus, the chain having of many repetitions of characters will contain much of 0. This ensures with an algorithm of the type Codage of Huffman a high quotient of compression.

During decompression, the coded chain is arranged alphabetically (one takes again the preceding example, this time in the direction of decompression):

          1 2 3 4 5
 Coded T T X E E
 Classified E E T T X
It is here that one makes use of the transmitted figure (4). We know that the two characters corresponding to this index are not followed and that the character of the classified line is the first of the original chain.

One thus leaves here “You in position 4. This “is to You the second of the classified line. One thus seeks the second “You of the coded line, which corresponds to position 2. This “is thus followed to You of a “E”. This “E” is the second of the classified line. One thus turns over to seek the second “E” of the coded line. One arrives in position 5. This “E” is followed of “X”…. One continues thus until falling on the “E” in position 4 from the coded line. Decompression is then finished. One finds our initial data well, namely the chain “TEXT”.

See too

Mtf

References

  1. Michael Burrows, D.J. Wheeler: " With block-sorting lossless dated compression algorithm" , 10th May 1994, DIGITAL SRC Research Carryforward 124.

Random links:Pylos | Garden of the Vestiges | Grosio | List general secretaries of the Federation of associations coeds of the campus of the University of | Julius Wernher

© 2007-2008 speedlook.com; article text available under the terms of GFDL, from fr.wikipedia.org