Working Of Search
Engine Robots
Search Engine Robots also called a web crawler (Web spider or
Web
robot) is a program or automated script which browses the World
Wide Web in a methodical, automated manner. Other less frequently
used names for Web crawlers are ants, automatic indexers, bots,
and worms (Kobayashi and Takeda, 2000). They are the seekers of
the web pages.
Many
legitimate sites, in particular search engines, use spidering as a
means of providing up-to-date data. Robots are mainly used to
create a copy of all the visited pages for later processing by a
search engine, that will index the downloaded pages to provide
fast searches. Robots can be used to gather specific types of
information from Web pages, such as harvesting e-mail addresses
(usually for spam).
Search engine robots have only basic functionality they just can't
do certain things. Robots don't understand frames, Flash movies,
images or JavaScript. They can't enter password protected areas
and they can't click all those buttons you have on your website.
They can be stopped cold while indexing a dynamically generated
URL and slowed to a stop with JavaScript navigation.
The automated robots first check the availability of a robots.txt
file while arriving at a website. This file is used to tell robots
which areas of the site are off-limits to them. Robots collect
links from each page visited, and later follow those links through
to other pages. In this way, they essentially follow the links
from one page to another. The entire World Wide Web is made up of
links, the original idea being that you could follow links from
one place to another. This is how robots get around.
When a search engine robot visits a page, it looks at the visible
text on the page, the content of the various tags in the page's
source code (title tag, meta tags, etc.), and the hyperlinks on
the page. From the words and the links that the robot finds, the
search engine decides what the page is about. Depending on how the
robot is set up through the search engine, the information is
indexed and then delivered to the search engine's database.
The information delivered to the databases then becomes part of
the search engine and directory ranking process. When the search
engine visitor submits query, the search engine digs through the
database to give the final listing that is displayed on the
results page.
One can see the pages visited by the search engine robots on the
site, by looking at the server logs or the results from the log
statistics program. Some robots are readily identifiable by their
user agent names, like Google's "Googlebot"; others are bit more
obscure, like Inktomi's "Slurp".