Coding of Huffman
Definition
The coding of Huffman is a algorithm of compression which was developed in 1952 by David Albert Huffman. It is compression of a statistical type which thanks to a method of tree that we will further detail makes it possible most frequently to code the bytes returning with a sequence of bits much shorter than usually. This algorithm offers Ratio compression shown the best possible ones for a coding by symbol . To go further, it is necessary to pass by more complex methods carrying out a probabilistic modeling of the source and benefitting from this additional Redondance (Lempel-Ziv, arithmetic Codage).
See also: MP3, Bzip2.
Principle
The principle of the coding of Huffman rests on the creation of a tree made up of nodes. Let us suppose that the sentence to be coded is “Wikipédia”. One first of all seeks the number of occurrences of each character (here the characters “has”, “of, “E”, “K”, “p” and “W” are represented each one once and the character “I” three times). Each character constitutes one of the sheets of the tree with which one associates a weight being worth his number of occurrences. Then the tree is created according to a simple principle: one each time associates the two nodes of weaker weights to give a node whose weight is not equivalent to the sum of the weights of its sons until having any more but one of them, the root. One then associates with each weakest branch of a node the code 0 and strongest code 1, as on the following diagram:
To obtain the binary Code of each character, one each time reassembles the tree starting from the root to the sheets by adding with the code one 0 or one 1 according to the followed branch. It is indeed necessary to leave the root to obtain the binary codes because during decompression, to leave the sheets would involve a confusion at the time of decoding. Here, to code “Wikipédia”, we thus obtain into binary: 101 11.011 11.100.010 001 11.000, is 24 bits.
There exist three alternatives of the algorithm of Huffman, each one of it defining a method for creation of the tree:
-
static: each byte has a preset code by the software. The tree does not need to be transmitted, but compression can be carried out only on one type of file (ex: a French text, where the frequencies of appearance of the “E” are enormous; this one will thus have a very short code, pointing out the Morse code).
- semi-adaptive: the file is initially read, so as to calculate the occurrences of each byte, then the tree is built starting from the weights of each byte. This tree will remain the same one until the end of compression. It will be necessary for decompression to transmit the tree.
- adaptive: it is the method which offers a priori the best ratios compression because the tree is built in a dynamic way progressively of the compression of the Flux. This method represents the large disadvantage however to have to rebuild the tree each time, which implies an enormous execution time.
Limitations of the coding of Huffman
One can show that for a source X, of Entropie H (X) the average length L of a word of code obtained by coding of Huffman checks:
This relation, which shows that the coding of Huffman approaches indeed the entropy of the source and thus of the optimum, can prove in fact rather not very interesting if the entropy of the source is low, and where a overcost of 1 bit becomes important. Moreover coding of Huffman forces to use an integer of bit for a symbol source, which can prove not very effective.
A solution with this problem is to work on blocks of N symbols. It is shown whereas one can approach in a finer way of the entropy:
but the process of estimate of the probabilities becomes more complex and expensive.
Moreover, the coding of Huffman is not adapted in the case of a source whose statistical properties evolve/move during time, since the probabilities of the symbols are then erroneous. The solution consisting in reappraising with each iteration the probabilities symbols is impracticable because of its complexity.
Anecdote
The first Macintosh of the company Apple used a code inspired of Huffman for the representation of the texts: the 15 most frequent characters of a language were coded on 4 bits, and the 16th configuration was used as prefix with the coding of the others on a byte (what thus made sometimes 4 bits, sometimes 12 bits by character). This simple method proved to save 30% of space on an average text, at one time when the random access memory remained still component expensive.
Bonds
Alternative:- Coding of Fano-Shannon
- Coding of Fano-Shannon
Refer
-
D.A. Huffman, " With method for the construction off codes" minimum-redundancy; (Pdf), Proceedings off the I.R.E., seven 1952, p 1098-1102
| Random links: | Pierre of Alexandria | Cyrtodactylus murua | Coupe de France of hockey 2001-2002 | List billionaires of the world in 1991 | Accesskey | Lynden,_Washington |