Shortened in IH or IR ( Information Retrieval in English), the search for Information is the science which consists in seeking information in documents - the documents themselves or the Métadonnée S which describe the documents -, in Databases - which they relational or are put in network by bonds Hypertexte as in the World Wide Web, Internet, and Intranet S, for the text, the sound, the images, the Donnée S. the Vocabulary of documentation (Paris, ADBS, 2004) distinguishes research from information of research of the information:
The search for information is a field historically related on the information sciences and the biblio-economics which always had the concern of establishing representations of the documents with an aim of recovering Information S of them, through the construction of index. The Informatique allowed the development of tools to process the data and to establish the representation of the documents at the time of their Indexation, like seeking information. One can today to say that the search for information is a field transdisciplinaire, which can be studied by several disciplines, approaches which should make it possible to find solutions to improve its effectiveness.
In the broad sense, the search for information includes two aspects:
In a more strict direction, the search for information could be reduced to the second aspect; but the great interdependence of both and attends it implemented of common techniques within the framework of policies of economic Intelligence resulted in supporting the first Acception, as attests it the existence of an important work group (SIGIR, Special Interest Group for Information Retrieval) in international association ACM (Association for Computing Machinery), as well as conference series organized on this subject by NIST: TREC (Text REtrieval Conference, but also treating aspects multi-media).
With the appearance of the first computers was born the idea to use machines to automate the search for information in the libraries. The first systems are used by booksellers and make it possible to carry out Boolean research , i.e. research where the presence or the absence of a term in a document led to the selection of the document. This research requires several intermediaries and especially large moyens : a nomenclature should indeed be created making it possible to describe the whole of the documents and to select for each document a whole of key words.
This description by key words (Indexation) supposes of the bookseller a sufficient knowledge to translate a question, which can be more or less precise, into a whole of descriptors. Moreover, the set of descriptors is often neither sufficient, nor enough precis to describe any document. It may be as, as for problems of Synonymie, certain documents answering the question of a user can not be found. Manual description is thus a slow process and which does not guarantee good performances.
However, it is completely possible to extract directly from the text a whole of descriptors. The first experiments even show that this approach is completely viable and competitive by report/ratio with the manual indexing. The growing use of the data-processing software, and consequently the provision of increasingly large quantities of directly interpretable texts by the computer then will involve the fast development of the models of IH. These two aspects, the indexing and research are in the middle of the problems tackled by the IH. The indexing and research very quickly evolved/moved of a Boolean modeling of research (a term represents or does not represent the document in the case of the indexing, a document answers or does not answer the question) with vectorial or probabilistic models.
The relevance of a document for a question in models which are based on a vague representation of the documents and questions expresses in the models of IH in the form of a score. This score does not allow any more one automatic validation of the systems of IH. Indeed, for the question “ the document must contain the word goat and élevage ”, a document containing the word “ chèvre ” and “ élevage ” is an good answer, contrary to a document which do not contain them. When the question becomes “ the document must have for topic the breeding of the chèvres ”, a document which speaks about care of the goats without using the word “ élevage ” will be an good answer, but will have a score less important than a document which speaks directly about the breeding of the goats.
It is thus impossible to prove that a system of IH is powerful since the score makes vague the concept of good réponse : a document answers more or less well a question. The concept of relevance of a document for a question thus emerges at the same time as the first systems of IH, with the first measurements making it possible to compare the various results returned by the systems of IH. The first measurements, still largely employed today, are the precision and the recall. A system of IH will be very precise so presque all the referred documents are pertinents. A system of IH will have much recall if it returns the majority of the relevant documents of the corpus for a question. In general, more one system of IH is precise, less it has recall and conversely.
Very quickly, of the related problems were also grafted around the IH. Among most current and most useful, the interaction with the user makes it possible to obtain increasingly relevant documents gradually. Some were then tried to simulate this interaction, or at least a part, by proposing techniques allowing “ of enrichir ” the question - by adding for example terms which were not in the original question. This technique is known under the name of expansion of request .
Information retrieval itself, the field evolves to close tasks, as the Classification which makes it possible to gather between them documents having close sets of themes, the classification the purpose of which is to classify the documents in a whole of preset categories. Then, as the concept of document and unit of information becomes fuzzier, the tasks of automatic extraction of information and summarized appear. Currently, the field gathers several sets of themes of research and evolves/moves with the appearance of new types corpus, documents and needs for users. Conferences TREC and SIGIR give an outline of the diversity of the research undertaken today in the general field of the IH.
The first stage in search for information is to establish these techniques making it possible to pass from a textual Document to an exploitable representation by a model of IH. This transformation is divided into two distinct stages and corresponds to the indexing of the documents :
It is necessary to extract from a text a whole of Descripteur S. Those are most of the time (after suppression of the grammatical words for example) the whole of the terms which appear in a document, often transformed (Lemmatization,…)
It is finally possible to use models able to interact with the user, in order to gradually improve the answers of the system of IH during a session - the user indicating each time the relevant documents for its question. These indications can also to be used to improve overall operation of the system as IH.
Historically, the search for information was made in the libraries with the protocol Z39.50 which was maintained by the Library of the Congress. This work continues with protocols SRW (Search/Retrieve via Web Services) and SRU (Search/Retrieve via URL).
| Random links: | Cornelis Johannes van Houten | Énnéades (Plotin) | Newspaper of the debates | Contea di Sclafani Merlot | Joel Merlaud |