Forum Moderators: goodroi
Between my cliets' sites and my own, I oversee about 50-80 websites. None are more than 200 pages. None have a robots.txt file on the server. I always thought robots.txt was for very large sites or when site owners wanted to control what the engines spidered. We have nothing to hide on any of our sites, so have never used them. As far as I am concerned, anyone (including robots) can look at the sites.
I'm now adding XML sitemaps because the major engines say they prefer that we do so. I notice that in Google Webmaster Tools, when logged in, in the Diagnostic area there is a link for robots.txt analysis. It says Last Downloaded and has today's date. Under status it says 404 (Not found). Which makes sense because there isn't one. Google then says: "We check for a new robots.txt file approximately once per day."
Is this like XML sitemaps where the search engines prefer that we have a robots.txt file? I'll put them in if the major engines really want me to, but I see no particular reason to otherwise.
as for xml sitemaps, just because the engines say they would like you to do something doesn't mean you should do it. for good webmasters i see no benefit from xml sitemaps. but we should discuss that in a seperate thread ;)
I hardly use robots.txt* and I don't use sitemaps, and I don't seem to suffer at all.
*Lately I've taken to using robots.txt in very specialised cases to block indexing of some legitimate duplicate content on deprecated URLs/mirrors by SOME bots mainly to save my bandwidth and the SEs'.
Rgds
Damon
*Lately I've taken to using robots.txt in very specialised cases to block indexing of some legitimate duplicate content on deprecated URLs/mirrors by SOME bots mainly to save my bandwidth and the SEs'.
And that is exactly what robots.txt is for -- To save bandwidth and control cooperative robots' crawling of your site.
Along with that comes an improvement in the usability/validity of your log files and stats, since they won't be full of 404-Not Found errors resulting from robots trying to fetch the customary robots.txt file.
You don't *have* to have a robots.txt file, but even if you don't need the robots-control facility it provides, adding one that's either blank, or that contains
User-agent: *
Disallow:
Jim
Damon also makes a good point in keeping it simple since I have helped many people who had their sites fall out of the search engines because they made a badly formatted robots.txt. The webmasters never used a validator to verify the robots.txt was correct. This is not to say robots.txt is hard. It is more a story of not being a lazy webmaster.
Internet Archive Wayback Machine: Takes a periodic snapshot of your site, making it available for browse/search years after pages may have been taken down. To block it, put these lines in your robots.txt file:
User-agent: ia_archiver
Disallow: /
Google Images, Yahoo Image Search, PicSearch: These crawlers look for images on your site, make a best-guess as to their content, and make it easy for everyone to view or download. Depending on whether you think this is good or bad, you may want to block them. Add these lines to your robots.txt file:
User-agent: Googlebot-Image
Disallow: /
User-agent: Yahoo-MMCrawler
Disallow: /
User-agent: psbot
Disallow: /
Even if you aren't concerned about inappropriate content being spidered by the major SEs, a robots.txt file will avoid clogging up your logs with 404 errors.
My understanding is : Robots will go through each and every page on your website they can find wether you want them to or not. The robots.txt file simply tells the spider not to save a copy of things listed on the robots.txt file and not to add those pages to the indexes. I don't think this will ever change because the robots also gather statistics for G (and others).
How many pages does the average website block? For a search engine company to know this they would need to spider them all.
That being said I think you need to use them to ensure that some things do not get indexed like "member profiles" etc unless you have a better way. A BETTER way to keep that content hidden is to make the links to profiles etc show up only when a user is logged in.
This thread has me wondering if a webmaster needs to hide anything at all because everything is a potential link back to your site from a search engine... but then I remember that our sites get rated by a machine that can't fully comprehend the content. Oh joy!
Robots will go through each and every page on your website they can find wether you want them to or not. The robots.txt file simply tells the spider not to save a copy of things listed on the robots.txt file and not to add those pages to the indexes.
Well-behaved robots, including those from major search engines, will not fetch a page if it is Disallowed in a properly-formatted robots.txt file.
Robots.txt was originally conceived as a way for Webmasters to prevent robots from consuming excess bandwidth, and to keep them from executing cgi scripts. However, now that the Web has gone commercial, there are many other good reasons to Disallow spiders from fetching various URLs.
A second control mechanism exists in the HTML <meta name="robots" content"noindex"> tag; Its function is different, and the file containing it must not be Disallowed in robots.txt, or the robots won't be able to fetch it to "read" it.
See www.robotstxt.org and w3c.org for authoritative information.
Jim
Well-behaved robots, including those from major search engines, will not fetch a page if it is Disallowed in a properly-formatted robots.txt file.
If they already have pages indexed or cached that have recently been disallowed in a revised robots.txt, then they won't remove those pages the SE's already indexed.
Quite a dilemma for webmasters.
If you accidentally allowed content to be published on two sites, you need to choose one and block the other or you will have problems with the engines.
Not sure where the dilemma is.
Many of these will actually go to Robots.txt to see what you are trying to hide or protect, and go straight to the restricted content. For this reason I use a dynamic robots.txt page.
Through proper use of .htaccess and mod_rewrite, every time my server calls up robots.txt, it invisibly serves a PHP page (although it looks the same to the viewer) and it detects what bot or browser is viewing the page. For search engine spiders I serve the real Robots.txt content for proper indexing, and for all others I simply disallow everything.