One can classify the methods of compressions in two types, compression with loss - also known as nonconservative - and compression without loss .

Compression without loss

Compression is known as without loss when there is no loss of data on the information of origin. There is as much information after compression than front, it is only rewritten in a more concise way (it is for example the case of compression Gzip for any type of data or of the format png for synthetic images intended for the Web). Compression without loss is known as also compaction.

Information to be compressed is seen like the exit of a source of symbols which produces texts finished according to certain rules. The goal is to reduce the intermediate size of the texts obtained after compression while having the possibility of finding the message of origin exactly (one finds also the denomination coding of source in opposition to the coding of channel who indicates coding correctors of errors).

The formats of file of compression without loss are known thanks to the extension added at the end of the file name (“ nomdefichier . zip ” for example), from where their very shortened denomination. The most current formats are:

  • 7z
  • ace
  • arc
  • arj
  • bz, bz2 (tar can be used to create the files of this type)
  • CAB, used by Microsoft
  • Gzip, gz (which is a file at only one entry, tar can be used to create the files of this type)
  • the KGB
  • Lzh
  • rar
  • Z (especially under Unix)
  • Zip
  • Zoo
  • FLAC (for the audio stream)

The most current open standards are described in several RFC S:

  • RFC 1950 (ZLIB, compressed data flow)
  • RFC 1951 (system of compression per blocks “DEFLATE”, used by zip and gz)
  • RFC 1952 (format of compressed file GZIP)

On the limits of compression without loss, to see Paradox of the compressor.

Coding RLE

See also: Run-length encoding

The letters RLE mean run-length encoding . It is about a mode of compression among simplest: any continuation of bits or identical natures is replaced by a couple (many occurrences; bit or repeated character).

Compression CCITT

See also: Compression CCITT

It is a compression of images used for the fax. It can be of type RLE (one codes the continuations of Pixel S white and black pixels) and bidirectional (one deduces a line from the preceding one). There exist several types of compressions (" groupe") according to the algorithm used and the number of colors of the document (monochromic, level of gray, color).

Coding of Huffman

See also: Coding of Huffman

The idea which governs the Codage of Huffman is close to that used in the Code Morse: to code what is frequent on little place, and to code on the other hand on longer sequences what seldom returns (entropy). In Morse the “E”, very frequent letter, were coded by a simple point, in short of all the signs.

The originality of David A. Huffman is that it provides a proceeded of aggregation objective making it possible to constitute his code since one has the statistics of use of each character.

The Macintosh of Apple coded the texts in a system inspired of Huffman: the 15 most frequent letters (in the language used) were coded on 4 bits, and the 16th combination was a code of exhaust indicating that the letter was coded in ASCII on the 8 following bits. This system allowed a compression of the texts close on average to 30% to one time when the memory was extremely expensive compared to the current prices (to count a factor 1000).

Lempel-Ziv 1977 (LZ or LZ77)

See also: LZ77 and LZ78

Lempel-Ziv compression replaces recurring reasons by references to their first appearance.

It gives less good compression ratios than of other algorithms (PPM, CM), but has the double advantage of being fast and asymmetrical (i.e. the algorithm of decompression is different from that of compression, which can be exploited to have a powerful algorithm of compression and a fast algorithm of decompression).

LZ77 is in particular the base of algorithms spread like Deflate (Zip, Gzip) or LZMA (7-Zip).

Lempel-Ziv 1978 and Lempel-Ziv-Welch (LZ78 and LZW)

See also: Lempel-Ziv-Welch

Lempel-Ziv-Welch compression is known as of dictionary type. It is based on the fact that reasons are found more often than others and than one can thus replace them by an index in a dictionary. The dictionary is built dynamically according to the reasons met.

Transform of Burrows-Wheeler (BWT)

See also: Transform of Burrows-Wheeler

They are a mode reorganization the data and not a mode of compression. It is mainly intended to facilitate the compression of text in natural language, but it is also usable to compress any binary data. This transformation, which is completely reversible, carries out a sorting on all rotations of the source text, which tends to gather the identical characters whole at exit, with the result that a simple compression applied to the produced data often allows a very effective compression.

Prediction by partial recognition (PPM)

See also: Prediction by partial recognition

The prediction by partial recognition bases a modeling of context to evaluate the probability of the various symbols. By knowing the contents of part of a data source (file, flow…), a PPM is able to guess the continuation, with more or less of precision. A PPM can be used in entry of an arithmetic coding for example.

The prediction by partial recognition in general gives better compression ratios that algorithms containing Lempel-Ziv, but is appreciably slower.

Note: PPM is also used for the autocomplétion of the orders in certain Unix systems.

Arithmetic coding

See also: arithmetic Coding

Arithmetic coding is rather similar to the coding of Huffman in this which it associates with the reasons the most probable shortest codes (entropy). Contrary to the coding of Huffman which produces codes of 1 as well as possible bit, arithmetic coding can produce empty codes. The compression ratio obtained is consequently better.

Weighting of contexts (CM)

See also: Weighting of contexts

The weighting of contexts consists in using several predictors (for example of PPMs) to obtain estimate the most reliable possible symbol to come in a data source (file, flow…). It can be basiquement realized by a weighted average, but the best results are obtained by methods of machine Learning like the Réseaux of neurons.

The weighting of contexts is very powerful in terms of compression ratio, but is all the more slow as the number of contexts is important.

Currently, the best compression ratios are obtained by algorithms binding weighting of contexts and arithmetic coding, like PAQ.

Compression with losses

Compression with losses applies only to the data “perceptuelles”, in general sound or visual, which can undergo a modification, sometimes important, without that not being perceptible by human. The loss of information is irreversible, it is impossible to find the data of origin after such a compression. Compression with loss for that is sometimes called irreversible compression or nonconservative .

This technique is founded on a simple idea: only a very weak subset of all the possible images (namely those which one would obtain for example by drawing the values from each pixel by a random generator) has a character exploitable and informative for the eye. These are thus these images that one will endeavor to code in a short way. In practice, the eye has need to identify zones which there exist correlations between close pixels, i.e. there exist contiguous zones of close colors. The programs of compression attempt to discover these zones and to code them in a way as compact as possible. The standard JPEG 2000, for example, generally manages to code photographic images on 1 bit by pixel without visible loss of quality on a screen, is a compression of a factor 24 to 1 .

Since the eye does not perceive necessarily all the details of an image, it is possible to reduce the quantity of data so that the result is very resembling the original, even identical, for the human eye. The problems of compression with losses are to identify the transformations of the image or the sound which make it possible to reduce the quantity of data while preserving perceptuelle quality.

In the same way, only a very weak subset of possible sounds is exploitable by the ear, which has need for regularities generating themselves a redundancy (to code with fidelity a noise of breath would not have great interest). A coding eliminating this redundancy and restoring it on arrival thus remains acceptable, even if the restored sound is not in any point identical to the sound of origin.

One can distinguish three big families from compression with loss:

  • by prediction, for example ADPCM;
  • by transformation. They are the most used most effective methods and. (JPEG, JPEG 2000, the whole of the standards MPEG…) ;
  • compression based on the recurrence fractale of reasons (Compression fractale).

Formats MPEG are formats of compression with losses for the video sequences. They include audio coders for this reason, like famous the MP3 or AAC, which can perfectly be used independently, and of course of the coders vidéos - generally simply referred by the standard on which they depend (ex: MPEG-2, MPEG-4), as well as solutions for the synchronization of the audio stream and videos, and for their transport on various types of networks.

Summary

  • Note 1: Certain algorithms can be patented.

  • Note 2: The format tiff encapsulates a mode of coding of the image, which can be compressed or not, with one of the caused algorithms.
  • Note 3: JPEG 2000 has a mode without loss (using a reversible transform in ondelettes) in addition to the standard mode with losses, from where its presence in the 2 parts of the table.

Random links:Santo Trafficante | San Vito Al Tagliamento | Anton Bernolák | Dimitar Penev | Mario Capecchi