Coding

Burrow-Wheeler coding

Video from Google describing the BWC

General idea:

Sorting a text file –> better compression.
Problem: sorting not reversible –> decoding impossible.
Solution: Reversible clustering: B-W transform.

Principle

Three steps encoding:
- B-W block-transformation of the text file
- MoveToFront reorderging of blocks
- Huffman/Arithmetic coding of blocks
Decoding in reverse order:
- H/A decoding
- Inverse MoveToFront
- Inverse B-W block-transformation

Remarks

Bzip2 implements the B-W coding with arithmetic code.
Disadvantage: input spit into blocks processed one at a time –> not continuously adaptative.

Burrows-Wheeler Transform:

Reversible transformation of a block of N symbols.
Principle: Sorts characters using context as sort key.
- chars occuring in similar context are close to each other.
Goal: make string as correlated as possible for the MTF to decorrelate the best.

Encoding

A circulant matrix of all circular shifts of the string is made.
Then the rows are lexicograpically sorted.
Output of the transform is the last column (real black magic no idea why we do this) + the index of the row containing the original input string (for decoding).

Decoding

We know the last column of the circulant matrix. We can recover the first column by sorting this last column (rows were sorted lexicographically). As the rows are the shifts of the string, the first and last columns form pairs of letters thar are consecutive in the original string. If we sort again those pairs, we’ll have the first two columns. Combine it with last column, we’ll have triples, we sort again and so on until we reach the end of the matrix. Once we have whole matrix, the index we kept during encoding enables to recover the original string.

MoveToFront

After B-W transform: clusters of chars in the string. Output is the index of the char in the alphabet. Once a char is met, it is moved at the front of the alphabet. Hence with repetitions 1 will occur very often. After the MTF, arithmetic/huffman coding can be applied for good compression results.

Run length codes

Used to describe binary input data. Encodes the first bit and then lengths of each groups of successive symbols in string. Encoding the length can be done using differents methods: Ellias codes, Gollomb-Rice, etc. To decode: odd and even runs.