Corpus

A corpus is a whole of documents, artistic or not (Texte S, Image S, Vidéo S, etc), gathered from a precise point of view. One can use corpora in several fields: literary studies , Linguistic S, scientific, etc

The corpus in linguistics

The branch of the Linguistique which is worried more specifically about the corpora calls logically the linguistic corpus . It is related to the development of the computing systems, in particular with the constitution of textual databases.

One speaks about corpus to indicate the normative aspect of the language: its structure and its code in particular. " Corpus" is generally opposed to " status" (or statute), which corresponds to the conditions of use of the language. This opposition is common in the study of the linguistic policies.

The corpus in literature

The corpus gathers a whole of texts having a common aiming. August 1st

The corpus in science

The corpora are essential tools and invaluable in automatic Traitement of the natural language. They indeed make it possible to extract a unit from useful information for statistical processing.

From an informative point of view, they make it possible to extract from the tendencies and in particular to build whole of N-gram S.

From a methodological point of view, they bring an objectivity necessary to the scientific validation in automatic Traitement of the natural language. Information is not any more empirical, it is checked by the corpus. It is thus possible to be based on corpora (in condition of course that they are well formed) to formulate and check scientific assumptions.

Well formed corpus

Several characteristics are to be taken into account for the creation of a well formed corpus:

  • size;
  • the language of the corpus;
  • the time covered by the texts of the corpus;
  • the register;

Cut

The corpus must obviously reach a critical size to allow reliable statistics treatments. It is impossible to extract from reliable information starting from a too small corpus (see Statistiques).

Language

A well formed corpus must necessarily cover only one language, and only one variation of this language. There exist for example subtle differences between French of France and French spoken in Belgium. It will thus not be possible to draw the reliable conclusions starting from a Franco-Belgian corpus on French of France, nor on French of Belgium.

Time covered by the texts of the corpus

Time plays a big role in the evolution of the language: French spoken today does not resemble spoken French 200 years ago nor, in a more subtle way, with spoken French 10 years ago, because in particular of the neologisms. It is a phenomenon to be taken into account for all the living languages. A corpus should not thus contain texts written with too broad time intervals.

Register of language

One should not either mix different registers and the scientist cannot be authorized to extract from information of a corpus intended for a certain register by applying them to another. A corpus built starting from scientific texts cannot be used to extract from information on the popularized texts, and a mixing corpus of the scientific and popularized texts will not make it possible to draw any conclusion on these two registers.

Methodology

It would be awkward from a methodological point of view to apply statistical processing to the corpus which made it possible to emphasize a classification or a modeling of the language.

When one works with corpora, it is thus advisable to separate an initial corpus in two under-corpora:

  • the corpus of training , which is used to withdraw a model or a classification starting from a sufficient number of information;
  • the corpus of test , which is used to check the quality of the training starting from the corpus of training.

The calibration of volumes of the corpora is discussed according to the problem, but it is frequent to use the 2/3 of the initial corpus for the training and the remaining third to carry out the tests.

When the volume of the initial corpus is not sufficient, it is possible to cross the corpora of tests and training on several experiments . For example, if one cuts out the initial corpus in 10 under-corpora, numbered from 1 to 10

  • Expérience 1: use of corpora 1 to 8 in training, and 9 and 10 for the tests;
  • Experiment 2: use of corpora 1 to 6 and 9 and 10 in training, 7 and 8 for the tests;

The measurement of quality of the results (precision or recall) is then more precise, but to in no case the corpora of training and tests was not mixed .

Parallel corpora and comparable corpora

Parallel corpora

One calls parallel corpus a whole of couples of texts such as, for a couple, one of the texts is the translation of the other. It is interesting to align these corpora, i.e. to make correspond each unit of the text in source language with each unit of text in target language (on the level of the paragraphs, sentences and words) to have a bilingual data file, in particular in specialized fields where the vocabulary and the use of the words and expressions evolve/move quickly.

As example, at October 26th, 2006, the versions French and English of the articles Decline of the Roman Empire of Occident and Decline off the Roman Empire are parallel texts. The text source is the english language version, the French version is the target , resulting from the translation.

Although the texts are known as parallels, the translation generates structural differences between the texts. Certain expressions can-being translated by a number different of words. For example “Theories butt the declines and fall off the Roman Empire” is composed of 10 words whereas its translation “Theories of the decline of the Roman Empire” is made up only of 7 words. In the same way, of the sentences in the text source are likely to be gathered in the translation, or, contrary, divided. Parallelism is thus never perfect and the methods of alignment must hold account of it.

The parallel corpora of texts are however relatively rare. As example let us quote the Pit-saw, which is the report of DEBATEs of the Canadian House of Commons, published in French and English.

Comparable corpora

The linguistics of corpus needing bulky data files to work, the parallel corpora are certainly very invaluable but too rare to suffice for all the uses.

The comparable corpora are largely more widespread. Déjean & Gaussier (2002) give the following definition of comparable corpus

Two corpora of two languages l_1 and l_2 are known as comparable if there exists a considerable under-part of the vocabulary of the corpus of language l_1, respectively l_2, whose translation is in the corpus of language l_2, respectively l_1.

A comparable corpus is thus composed of texts in different languages but sharing part of the vocabulary employed, which generally implies that the texts speak about the same subject, at the same time and in a comparable register. A selection of articles of newspapers in various languages, treating of the same international topicality and at the same time constitutes a comparable good example of corpus.

Alignment cannot thus be based any more on the structure of the text (which does not have to be identical of one language to the other) and them approaches suggested rather seek to take into account the context of each term to be aligned, i.e. the way in which they are employed and the words with which they Co-occurrent in the text.

See too

  • the Corpus indicates a group of Insecte S.

Notes & references

Random links:Multi-media civil society of the authors | Haemodracon riebeckii | Brownish Roberta | Taekkyeon | Pelican with glasses | Nikolski,_Alaska