Robot.txt
July 19, 2006
Robots.txt is a file that resides on a website top-level directory which its mission is to control the access of search engines. This is in fact the older and more widely accepted internet protocol formerly known as Robots Exclusion Standard, and it is intended from its early days to prevent web spiders, web crawlers, worms, web ants and other web robots from accessing all or part of a website.
This file is not only used to keep web pages and directories from being indexed by search engines, but also serves to prevent the indexing of particular types of files such as images, or stop malicious bots used to grab email addresses and other sensitive information from sites that manage members through databases. Robots.txt should be placed in the website’’s root directory, but it may be added to any other sub-directory to prevent the spiders access.
If your site has been indexed already and you want to prevent further indexing of certain areas, a robots.txt file can make that trick in the next web crawling, stopping the robots. Search engines update their indexes with the information gathered by the robots and spiders so it is assumed that your content will not be displayed in the next update, however sometimes it is necessary to contact the search engines to request the exclusion or removal manually, since older copies of those crawled files may remain on their index for months before they disappear.
Robots, spiders, agents, crawlers, worms and ants make the same sort of thing with slightly different connotations, but all of them that adhere to the Robots Exclusion Standard will obey the instructions given in a robots.txt file. This is a simple text file with no formatting, just basic information that robots understand. Every time a web crawler visits your site, the first thing it looks for is the robot.txt file.
If the spider does not detect the presence of a robot.txt it indexes your web site, except when it finds other indexing preventive measures, such as html Meta robots tags or .htaccess rules. The robots.txt is basically made with the names of the directories or files to be excluded and the specific instruction in the form or fields such as “User-agent” that specifies which spiders are allowed to access the site and the “Allow/Disallow” command specifying which directories may be accessed by a simple substring comparison explained in full at the http://www.robotstxt.org website.
Share ThisComments
Got something to say?


