The lemmatization of a Mot is the function which, with a word, associates its canonical Forme.

General information

The words (lemmas) of a Langue use several forms according to their kind (male or female), them number (one or more), them nobody (me, you, them,…), them mode (indicative, imperative,…) thus giving rise to several forms for same a lemma.

The lemmatization of a form of a word consists in taking its canonical form of it. This one is defined as follows:

  • for a Verb: this verb with the Infinitive ,
  • for the other words: the word with the male Singular.

It will thus be noted that all the entries of a Dictionnaire are lemmatisées and that it is the same for the titles of the articles of Wikipédia (at least those consisted of only one lemma).

Examples

The small adjective exists in four forms: small , small , small and small . The canonical form of all these words is small .

There exist much more forms of the verb to have : have , ace , has , have , board , had , had , had had , would have had , etc the canonical form of had had is to have .

Use in data processing

In Data-processing, it is difficult for a program of knowing that had had and to have is two facets of the same term. The lemmatization is thus a preliminary operation for the recognition of the words of a sentence.

See too

Internal bonds

External bonds

  • Lemmatization of the verbal forms
  • Algorithm of lemmatization for the French language (in English)

Random links:Shou | RPG Maker | Ave Regina | The Toothing-stone and Shade | William George Browne

© 2007-2008 speedlook.com; article text available under the terms of GFDL, from fr.wikipedia.org