The Robots.Txt File
The robots.txt file is used to prevent a search engine from
listing a particular resource in its index. Most search engines
interpret a resource being disallowed by the robots.txt file as
meaning they should not add it to their index, and if it is
already in their index (placed there by previous spidering
activity) they remove it.
Search engine robots have only basic functionality like that of
early browsers in terms of what they can understand in a web page.
Like early browsers, robots just can't do certain things. Robots
don't understand frames, Flash movies, images or JavaScript. They
can't enter password protected areas and they can't click all
those buttons you have on your website. They can be stopped cold
while indexing a dynamically generated URL and slowed to a stop
with JavaScript navigation.
While arriving at a website, the automated robots first check the
availability of a robots.txt file. This file is used to tell
robots which areas of your site are off-limits to them. Robots
collect links from each page they visit, and later follow those
links through to other pages. In this way, they essentially follow
the links from one page to another. The entire World Wide Web is
made up of links, the original idea being that you could follow
links from one place to another. This is how robots get around.
The Disallow line in a robots.txt file means "disallow reading",
but that does not mean "disallow indexing". In other words a
disallowed resource may be listed in a search engine’s index, even
if the search engine follows the protocol.