The lemmatization of a Mot is the function which, with a word, associates its canonical Forme.
General information
The words (
lemmas) of a Langue use several forms according to their
kind (male or female), them
number (one or more), them
nobody (me, you, them,…), them
mode (indicative, imperative,…) thus giving rise to several forms for same a
lemma.
The lemmatization of a form of a word consists in taking its canonical form of it. This one is defined as follows:
- for a Verb: this verb with the Infinitive ,
- for the other words: the word with the male Singular.
It will thus be noted that all the entries of a Dictionnaire are lemmatisées and that it is the same for the titles of the articles of Wikipédia (at least those consisted of only one lemma).
Examples
The small adjective
exists in four forms:
small ,
small ,
small and
small . The canonical form of all these words is
small .
There exist much more forms of the verb to have : have , ace , has , have , board , had , had , had had , would have had , etc the canonical form of had had is to have .
Use in data processing
In Data-processing, it is difficult for a program of knowing that
had had and
to have is two facets of the same term. The lemmatization is thus a preliminary operation for the recognition of the words of a sentence.
See too
Internal bonds
External bonds
- Lemmatization of the verbal forms
- Algorithm of lemmatization for the French language (in English)