See also: Zipf
One names Loi of Zipf an empirical observation of the frequency of the words in a text. It took the name of its author, George Kingsley Zipf (1902 - 1950). This law was thereafter generalized by Benoit Mandelbrot.
Zipf had undertaken to analyze a monumental work of James Joyce, Ulysses , to count the distinct words, and to present them of them by decreasing order of number of occurrences. The legend says that
These results seem, in the light of other studies that one can make in a few minutes on his computer, a little too beautiful to be strictly true - the tenth word, in a study of this kind, should appear in the 1 000 times, because of an effect of elbow observed in this kind of distribution. Remain that the law of Zipf provides that in a given text, the frequency of occurrence F ( N ) of a word is related to its row N in the order of the frequencies by a law of kind F ( N ) × N = K where K is a constant.
Mathematically, it is impossible for the traditional version of the law of Zipf to hold exactly if there exists an infinity of words in a language, since for any constant of proportionality C > 0, the sum of all the relative frequencies is proportional to the harmonic series and must be
Observations quoted by Leon Brillouin in his book Science and information theory suggested that in English, the frequencies of roughly 1 000 words most frequently used were roughly proportional to with S just slightly larger than 1.
As long as the exhibitor S exceeds 1, it is possible for such a law to be true with an infinity of words, since if S > 1 then
The value of this sum is , where ζ is the Fonction Zeta of Riemann.
It is known however that the number of words of a language is limited. The vocabulary of a 10 year old child turns around 5.000 words, that of a cultivated adult of 70.000, and the dictionaries in several volumes can go up from 130.000 to 200.000.
Benoît Mandelbrot showed in the years 1950 qu ' a law similar to that of Zipf could result from two considerations related to the Information theory of Claude Shannon.
One needs for example 5 bits to account for numbers from 0 to 31, but 16 for numbers from 0 to 65535. In the same way, one can form 17576 initials of 3 letters, but 456976 of 4 letters, etc
It thus eliminates the cost between the two equations and is found with a family of equations necessarily binding the frequency of a word to its row if it is wanted that the channel is used in an optimal way. It is the Loi of Mandelbrot, whose that of Zipf represents only one particular case, and who is given by the law:
where K is a constant.
the law bringing back itself to that of Zipf in the particular case where has would be worth 0, B and C both 1, case which does not meet in practice. In the majority of the existing Language S, C is close to 1,1 or 1,2, and close to 1,6 in the language of the children.
The laws of Zipf and Mandelbrot take a spectacular aspect if one traces them according to frames of reference log-log: the law of Zipf corresponds then to a beautiful line, and that of Mandelbrot to the same thing with a characteristic bump. This bump is found precisely in the literary texts available on the Net, analyzable in a few minutes on home computer with languages like the Python. The curve provided here represents the decimal logarithm of the number of occurrences of the terms of a forum of the Web traced according to the decimal logarithm of the row of these words.
One notes that the most frequent word appears there a little more 100 000 times (10 5 ).
The relationship between laws of Zipf and Mandelbrot on the one hand, between laws of Mariotte and van der Waals on the other hand is similar: there is in the first cases law of a hyperbolic type, in the seconds a light correction giving an account of the variation between what was envisaged and what is observed, and proposing a justification. In both cases, an element of correction is the introduction of constant expressing something of “incompressible” (at Mandelbrot, the term has ).
It will be admitted that if one trusts a first visual impression just, this curve appears “zipfienne very”, whereas it is a very other model which generated the series of the data. However it is not possible conveniently to make a Chi2 on the law of Zipf, the sorting of the values coming to make obstacle with the use of a traditional probabilistic model (let us not forget indeed that the distribution of the occurrences is not not that of the probabilities of occurrences, and that can lead to many inversions in the sorting).
The family of distributions of Mandelbrot is certainly shown adequate in a formal way for a human language under her assumptions starting concerning the storage cost and the cost of implementation, which rise themselves from the information theory. On the other hand it is not proven that to use the law of Zipf as model for the distribution of the populations of the agglomerations of a country is a relevant model - although the opposite is not proven either.
Let us add that the estimate of the parameters of Mandelbrot starting from a set of data also poses problem, and is the subject still today of debates. It is quite out ofit is quite out of the question for example to use a method of least squares about a curve in log-log whose in addition weight of the points is far from being comparable. Mandelbrot itself apparently did not make of new communication on the subject since the end of the the Sixties.
| Random links: | Multipoint Control Links | Executive power | Martin Strolz | Acid Eaters | Place Général Meiser |