See also: Zipf

One names Loi of Zipf an empirical observation of the frequency of the words in a text. It took the name of its author, George Kingsley Zipf (1902 - 1950). This law was thereafter generalized by Benoit Mandelbrot.

Genesis

Zipf had undertaken to analyze a monumental work of James Joyce, Ulysses , to count the distinct words, and to present them of them by decreasing order of number of occurrences. The legend says that

  • the word more the current returned 8  000 times;
  • the tenth word 800 times;
  • the hundredth, 80 times;
  • and thousandths, 8 times.

These results seem, in the light of other studies that one can make in a few minutes on his computer, a little too beautiful to be strictly true - the tenth word, in a study of this kind, should appear in the 1  000 times, because of an effect of elbow observed in this kind of distribution. Remain that the law of Zipf provides that in a given text, the frequency of occurrence F ( N ) of a word is related to its row N in the order of the frequencies by a law of kind F ( N ) × N = K where K is a constant.

Theoretical point of view

Mathematically, it is impossible for the traditional version of the law of Zipf to hold exactly if there exists an infinity of words in a language, since for any constant of proportionality C > 0, the sum of all the relative frequencies is proportional to the harmonic series and must be

\ sum_ {n=1} ^ \ infty \ frac {C} {N} = \ infty \ neq 1.

Observations quoted by Leon Brillouin in his book Science and information theory suggested that in English, the frequencies of roughly 1  000 words most frequently used were roughly proportional to \ frac {1} {n^s} \, with S just slightly larger than 1.

As long as the exhibitor S exceeds 1, it is possible for such a law to be true with an infinity of words, since if S > 1 then

\ sum_ {n=1} ^ \ infty \ frac {1} {n^s} < \ infty.

The value of this sum is \ zeta (S) \, , where ζ is the Fonction Zeta of Riemann.

It is known however that the number of words of a language is limited. The vocabulary of a 10 year old child turns around 5.000 words, that of a cultivated adult of 70.000, and the dictionaries in several volumes can go up from 130.000 to 200.000.

A particular case of a general law

Benoît Mandelbrot showed in the years 1950 qu ' a law similar to that of Zipf could result from two considerations related to the Information theory of Claude Shannon.

Static law of Shannon

According to the static law, the cost of representation of information increases like the logarithm of the number of information to consider.

One needs for example 5 bits to account for numbers from 0 to 31, but 16 for numbers from 0 to 65535. In the same way, one can form 17576 initials of 3 letters, but 456976 of 4 letters, etc

Dynamic law of Shannon

The dynamic law indicates how to thus maximize the utility of a channel by maximization of the Entropie by firstly using the least expensive symbols (in Code Morse the E , frequent letter, is coded by a simple point (. ) while the X , rarer letter, is represented by a feature not milked ( -. - ). The Codage of Huffman applies this dynamic law.

The synthesis of Mandelbrot

Mandelbrot puts forth the daring assumption that the cost of implementation is directly proportional to the storage cost, which it notes is true on all the devices that it observed, of the entry to the computers.

It thus eliminates the cost between the two equations and is found with a family of equations necessarily binding the frequency of a word to its row if it is wanted that the channel is used in an optimal way. It is the Loi of Mandelbrot, whose that of Zipf represents only one particular case, and who is given by the law:

f (N) \ times (+ bn has) ^c = K \, where K is a constant.

the law bringing back itself to that of Zipf in the particular case where has would be worth 0, B and C both 1, case which does not meet in practice. In the majority of the existing Language S, C is close to 1,1 or 1,2, and close to 1,6 in the language of the children.

The laws of Zipf and Mandelbrot take a spectacular aspect if one traces them according to frames of reference log-log: the law of Zipf corresponds then to a beautiful line, and that of Mandelbrot to the same thing with a characteristic bump. This bump is found precisely in the literary texts available on the Net, analyzable in a few minutes on home computer with languages like the Python. The curve provided here represents the decimal logarithm of the number of occurrences of the terms of a forum of the Web traced according to the decimal logarithm of the row of these words.

  • One notes that the most frequent word appears there a little more 100  000 times (10 5 ).

  • the size of the vocabulary actually used (it would be more exact to speak about the size of the whole of the inflected forms) is about 60  000 (#10 4.7 ).
  • the linear aspect of Zipf appears there clearly, although the characteristic elbow explained by Mandelbrot there either only light. It will be also noted that the slope is not exactly of −1 like would like it the law of Zipf.
  • the projected intersection of this curve with the x-axis would provide starting from a text of limited size (some typed A4 pages) an estimate of extended from the vocabulary of a script writer.
    • One can notice that we deliver ourselves already subjectively to the same estimate by reading some pages of a writer that we do not know, and that it is what allows us by dividing into sheets a work to know if this vocabulary is in adequacy with ours.
    • One can also notice that the repetition of erudite words wanting to be like extemporanément or hieratic will not make illusion, since it is the repetition itself which constitutes the index of poverty of the vocabulary and not the words used, whatever they are.

Similarity

The relationship between laws of Zipf and Mandelbrot on the one hand, between laws of Mariotte and van der Waals on the other hand is similar: there is in the first cases law of a hyperbolic type, in the seconds a light correction giving an account of the variation between what was envisaged and what is observed, and proposing a justification. In both cases, an element of correction is the introduction of constant expressing something of “incompressible” (at Mandelbrot, the term has ).

A law to be used with prudence

It is trying each time one sees the information classified by decreasing order to say itself: “They must follow a law of Zipf”. Without they being necessarily false, it would be dangerous to regard it as self-explanatory. If we take for example 100 random entireties between 1 and 10 according to a uniform Loi, that we gather them and that we sort the number of occurrences of each one, we obtain the curve opposite.

It will be admitted that if one trusts a first visual impression just, this curve appears “zipfienne very”, whereas it is a very other model which generated the series of the data. However it is not possible conveniently to make a Chi2 on the law of Zipf, the sorting of the values coming to make obstacle with the use of a traditional probabilistic model (let us not forget indeed that the distribution of the occurrences is not not that of the probabilities of occurrences, and that can lead to many inversions in the sorting).

The family of distributions of Mandelbrot is certainly shown adequate in a formal way for a human language under her assumptions starting concerning the storage cost and the cost of implementation, which rise themselves from the information theory. On the other hand it is not proven that to use the law of Zipf as model for the distribution of the populations of the agglomerations of a country is a relevant model - although the opposite is not proven either.

Let us add that the estimate of the parameters of Mandelbrot starting from a set of data also poses problem, and is the subject still today of debates. It is quite out ofit is quite out of the question for example to use a method of least squares about a curve in log-log whose in addition weight of the points is far from being comparable. Mandelbrot itself apparently did not make of new communication on the subject since the end of the the Sixties.

See also

Random links:Multipoint Control Links | Executive power | Martin Strolz | Acid Eaters | Place Général Meiser

© 2007-2008 speedlook.com; article text available under the terms of GFDL, from fr.wikipedia.org