Robots Exclusion
Protocol
When a compliant Web Robot visits a site, it first checks for a
"robots.txt"
URL on the site. If this URL exists, the Robot parses its contents
for directives that instruct the robot not to visit certain parts
of the site. Directives making sense for the site can be created.
Sometimes people find they have been indexed by an indexing robot,
or that a resource discovery robot has visited part of a site that
for some reason shouldn't be visited by robots. In recognition of
this problem, many Web Robots offer facilities for Web site
administrators and content providers to limit what the robot does.
This is achieved through two mechanisms:
1. The Robots Exclusion Protocol: A Web site administrator can
indicate which parts of the site should not be visited by a robot,
by providing a specially formatted file.
2. The Robots META tag: A Web author can indicate if a page may or
may not be indexed, or analyzed for links, through the use of a
special HTML META tag.
There can be a single robots.txt on a site. Specifically,
"robots.txt" files should not be in user
directories, because a robot will never look at them. To use the
Robots Exclusion Protocol, one has to liaise with the
server administrator, and get added the rules to the "robots.txt", using the Web Server Administrator's Guide
to the Robots Exclusion Protocol. If the administrator is
unwilling to install or modify "robots.txt" rules, and the need
is to prevent being indexed by indexing robots like WebCrawler and
Lycos, one can add a Robots Meta Tag to all pages not to be
indexed. Note this functionality is not implemented by all
indexing robots.