Forum Moderators: Robert Charlton & goodroi
My guess is that somewhere, someone has linked to these robots.txt specific files - including yours, most likely. Certainly the top results for that query are robots.txt files that I know have been linked to - including ours here at WebmasterWorld.
Google should simply drop all robots.txt files from their index, IMO - but we're the tail trying to wag the dog on this kind of thing. How is this causing you a problem, Mike? Is your robots.txt actually ranking on a keyword query?
At any rate, this does bring up the crazy question, how can you remove a robots.txt file from Google's index? If you use robots.txt to block it, that would mean that googlebot should not even request robots.txt - an insane loop. And of course, you don't use meta tags in a robots.txt file.
Unless this is some kind of a problem on keyword queries, I'd suggest you just chalk it up to strange and move on.
If you don't want spiders to index certain URLs, your only recourse is to block them, preferably at a point in the network chain as close to the request as possible.
Wonder whether it could be removed within the webmaster tools by usingRemove URLs
In order to use that tool, you have three options:
1. block with robots.txt
2. use a noindex in a robots meta tag
3. return a 404
None of these options are viable, unless just maybe you can afford to have no robots.txt file at all. Still, that is not sane.
How is this causing you a problem, Mike? Is your robots.txt actually ranking on a keyword query?
You never know if it is causing a problem with Google. Wondered if it was a side effect of something bad. We were the victim of a server hack (someone uploaded p*rn images into every page) two weeks ago, then last week someone linked to 1000+ pages that did not exist on the site (yes, Google indexed them too as duplicate content even though they had noindex/nofollow tags) so I wondered if this was the latest ranking attack by our competitors.
...META tags give you useful control over how each webpage on your site is indexed. But it only works for HTML pages. How can you control access to other types of documents, such as Adobe PDF files, video and audio files and other types? Well, now the same flexibility for specifying per-URL tags is available for all other files type.Official Google Blog [googleblog.blogspot.com]
I added this to robots.txt:
Disallow: /robo
Google picked up the new robots.txt file after a few days (for use in checking what they can index).
They looked at it several times each week, but the OLD robots.txt content remained in the SERPs and in their cache and snippet for many weeks.
About a month later, the robots.txt file dropped from the SERPs.
You can use robots.txt to block the indexer from indexing its content.
That will not restrict access for the bot that retrieves the file for its real intended purpose.
If you use robots.txt to block it, that would mean that googlebot should not even request robots.txt - an insane loop. And of course, you don't use meta tags in a robots.txt file.
We already have a very common situation where the contents of robots.txt doesn't prevent robots.txt actually being fetched:
User-agent: *
Disallow: /
Clearly robots.txt is included under the root directory, but still must be requested and requested repeatedly to see if it has changed.
So just as
Disallow: /*.txt$
Disallow: /robots.txt
...., why not use a noindex directive in the robots.txt file?
They have collected more than two million robots.txt files, and it is interesting to see which bots other people are blocking.
I use Disallow: /robo in my robots.txt so will not have the problem of my own data being so open to inspection. :-)