Robot of indexing
See also: Spider
A robot of indexing (or spider ; in English Web crawler or Web spider ) is a Logiciel which explores the Web automatically. It is generally conceived to collect the resources (Web pages, images, Vidéo S, documents Word, pdf or PostScript, etc), in order to allow a Search engine them to index.
Functioning on the same principle, certain robots are used to file the resources or to collect email addresses to which to send junk emails.
Principles of indexing
For to index new resources, a robot proceeds while following recursively the hyperlinks found starting from a pivot page. Thereafter, it is advantageous to memorize URL each recovered resource and to adapt the frequency of the visits to the frequency observed of update of the resource. However, from many resources escape this recursive exploration, because only of the hyperlinks created with the request, therefore untraceable by a robot, allow to reach it. This unexplored whole of resources is sometimes called major Web.
A file of exclusion (robots.txt) placed in the root of a Web site makes it possible to give to the robots a list resources to be ignored. This convention makes it possible to reduce the load of the Web server and to avoid resources without interest. On the other hand, certain robots are not worried this file.
Two characteristics of the Web complicate the work of the robot of indexing: great volumes of data and the Band-width. A very great number of pages are added, modified and removed each day. If the capacity of Storage of information, like the speed of the Processors, increased quickly, the band-width did not profit from the same progression. The problem is thus to treat a volume always crescent of information with a limited flow. The robot thus needs to give priorities to its remote loadings.
The behavior of a robot of indexing results from the combination of the following principles:
- a principle of selection which rules which page to download.
- a principle of revisits which rules when to check if there are changes in the pages.
- a principle of courtesy which rules how to avoid the overloads of Web pages.
- a principle of parallelization which rules how to coordinate the robots of indexings distributed.
Robots
Free robots
-
GNU Wget is a free software in Ligne of order writes in C automating the transfers towards a customer HTTP.
-
Heritrix is the robot of filing of Internet Files. He was written in Java.
-
HTTrack is a suction software of Internet site which creates mirrors of the Web sites for a use off line. It is distributed under the license LPG.
-
Nutch is a robot of collection written in Java and published under License Apache. It can be used with the project Lucene of the Apache foundation.
Robots owners
- Scooter of AltaVista;
- MSNBot of MSN.
- Slurp of Yahoo!;
- KB Crawl of BEA-Council;
- OmniExplorer_Bot of OmniExplorer
- TwengaBot of Twenga
See Too
Related articles
External bonds
-
Introduction to natural referencing - Article on Prohibited Web
- Encyclopedia of the robots ( annuaire-info.com ): recent information on more than 100 robots of the Web (use-agent, addresses IP, origin,…)
| Random links: | Solenostemon | Castle of Belcastel | Tsh (trigram) | Srbobran | Kijevac (Surdulica) | Valine |