Blocking Robots
The decision of which robots to block is a very personal matter.
Blocking is available for many different offline downloaders,
programs designed to spider a site and download various files on
behalf of real people. Some specialize in stealing images, others
are just general-purpose spiders. Fighting robots is a neverending
battle with no winners, only casualties. One can never stop all
abusive behavior from all automated robots and rude programs, but
can minimize their effects and reduce the abuse to acceptable
levels. An ip address from where the bot is originating or the
user-agent string the bot uses are required before blocking.
A good deal of the hits listed in your logs are likely to come
from the many programs - commonly known as robots, bots,
crawlers and spiders - that automatically trawl the web for a variety of
purposes, including:
• Indexing your site (e.g. Googlebot, Inktomi Slurp)
• Gathering statistics (e.g. WebWatch)
• Site maintenance and validation (e.g. LinkWalker,
W3CLinkChecker)
Here are 4 techniques for blocking unwelcome robots from accessing
the site:
1. Robots.txt: To implement robots.txt create a text file called
robots.txt in the root directory of your site.
2. Blocking ip addresses using <Limit> (Apache only): If the ip
address is known of the robot accessing the site, and the site
runs on the Apache web server, the <Limit> directive provides a
convenient and effective method of blocking access.
3. Blocking user agents using SetEnvIfNoCase and <Limit> (Apache
only): This method uses the user-agent string to restrict access.
4. Blocking ip addresses or user agents using PHP: This means
blocking access from within web content. This method is useful
when one wants to block a bot from very specific content, or serve
alternative content based on the user agent or ip address of the
bot.