Why should I have a robots.txt file? - Sitemaps, Meta Data, and robots.txt forum at WebmasterWorld - WebmasterWorld

Forum Moderators: goodroi

Message Too Old, No Replies

Why should I have a robots.txt file?

Isn't it OK if I don't?

beren

11:36 pm on Dec 29, 2006 (gmt 0)

10+ Year Member

Top Contributors Of The Month

This might be a dumb question, but I did not see one like it anywhere, including the forum library.

Between my cliets' sites and my own, I oversee about 50-80 websites. None are more than 200 pages. None have a robots.txt file on the server. I always thought robots.txt was for very large sites or when site owners wanted to control what the engines spidered. We have nothing to hide on any of our sites, so have never used them. As far as I am concerned, anyone (including robots) can look at the sites.

I'm now adding XML sitemaps because the major engines say they prefer that we do so. I notice that in Google Webmaster Tools, when logged in, in the Diagnostic area there is a link for robots.txt analysis. It says Last Downloaded and has today's date. Under status it says 404 (Not found). Which makes sense because there isn't one. Google then says: "We check for a new robots.txt file approximately once per day."

Is this like XML sitemaps where the search engines prefer that we have a robots.txt file? I'll put them in if the major engines really want me to, but I see no particular reason to otherwise.

pleeker

7:58 am on Dec 30, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Your impression is correct that the purpose of robots.txt is just to control what obedient bots will and won't crawl.

If the bot doesn't find a robots.txt, it assumes it's okay to crawl anything and everything it finds under that domain. If you're okay with that, then no worries. :)

The Contractor

12:56 pm on Dec 30, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Under status it says 404 (Not found).

I would put up a blank one if to do nothing more than stop logging 404's in the server logs.

goodroi

1:58 pm on Dec 30, 2006 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

Robots.txt are also helpful in blocking the bots from indexing directories that contain scripts. If you have a very plain site that you would be ok with having the engines crawl everything than you do not need a robots.txt.

as for xml sitemaps, just because the engines say they would like you to do something doesn't mean you should do it. for good webmasters i see no benefit from xml sitemaps. but we should discuss that in a seperate thread ;)

DamonHD

2:01 pm on Dec 30, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Hi,

I hardly use robots.txt* and I don't use sitemaps, and I don't seem to suffer at all.

*Lately I've taken to using robots.txt in very specialised cases to block indexing of some legitimate duplicate content on deprecated URLs/mirrors by SOME bots mainly to save my bandwidth and the SEs'.

Rgds

Damon

jdMorgan

3:56 pm on Dec 30, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

*Lately I've taken to using robots.txt in very specialised cases to block indexing of some legitimate duplicate content on deprecated URLs/mirrors by SOME bots mainly to save my bandwidth and the SEs'.

And that is exactly what robots.txt is for -- To save bandwidth and control cooperative robots' crawling of your site.

Along with that comes an improvement in the usability/validity of your log files and stats, since they won't be full of 404-Not Found errors resulting from robots trying to fetch the customary robots.txt file.

You don't *have* to have a robots.txt file, but even if you don't need the robots-control facility it provides, adding one that's either blank, or that contains

User-agent: *
Disallow:

is a very good idea, if just to keep your access log and error log clean, and avoid skewing your stats with all those errors from attempted robots.txt fetches.

Jim

DamonHD

5:04 pm on Dec 30, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

My feeling is, in the spirit of KISS (Keep It Simple, Stupid), that unless you NEED it, don't put it up at all, and filter out any (spurious) errors from its absense in other ways.

Rgds

Damon

goodroi

1:11 pm on Dec 31, 2006 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

I do agree with Jim that it would be ideal to make a robots.txt and allow everything.

Damon also makes a good point in keeping it simple since I have helped many people who had their sites fall out of the search engines because they made a badly formatted robots.txt. The webmasters never used a validator to verify the robots.txt was correct. This is not to say robots.txt is hard. It is more a story of not being a lazy webmaster.

jwolthuis

4:36 pm on Dec 31, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Besides the "normal" search engines, there are specialty search engine you may want to allow or block. At a minimum, it's good to be aware of them.

Internet Archive Wayback Machine: Takes a periodic snapshot of your site, making it available for browse/search years after pages may have been taken down. To block it, put these lines in your robots.txt file:

User-agent: ia_archiver
Disallow: /

Google Images, Yahoo Image Search, PicSearch: These crawlers look for images on your site, make a best-guess as to their content, and make it easy for everyone to view or download. Depending on whether you think this is good or bad, you may want to block them. Add these lines to your robots.txt file:

User-agent: Googlebot-Image
Disallow: /

User-agent: Yahoo-MMCrawler
Disallow: /

User-agent: psbot
Disallow: /

wrgvt

4:38 pm on Dec 31, 2006 (gmt 0)

10+ Year Member

The only thing I do with my robots.txt file is to disallow crawling of my images directory. I have enough problems with hotlinked images as it is.

netmeg

6:45 pm on Dec 31, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Some of my sites get hit really hard by spiders (specially overseas) that have no reason to be spidering them - sucking up a ton of bandwidth - and we block for that reason.

AhmedF

7:11 pm on Dec 31, 2006 (gmt 0)

10+ Year Member

I block out print pages and also AJAX-driven pages (Google ripped through my JS hidden links and had me very confused for a while).

Bones

12:44 pm on Jan 1, 2007 (gmt 0)

10+ Year Member

Not sure how relevant this is today, but still worth a read perhaps:
[webmasterworld.com...]

rogerd

3:53 am on Jan 2, 2007 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

It's a handy place for a blog [webmasterworld.com], too. Of course, that approach may not be for everybody. ;)

Even if you aren't concerned about inappropriate content being spidered by the major SEs, a robots.txt file will avoid clogging up your logs with 404 errors.

Receptional

4:48 pm on Jan 2, 2007 (gmt 0)

We use it to help eliminate bad traffic. Without it we appear top for odd things "terms of trade" being the one that eventually made us realize that by not using robots.txt to eliminate non-(seo)-marketing pages we were in danger of losing focus.

Kurgano

2:55 am on Jan 3, 2007 (gmt 0)

10+ Year Member

I've read countless hundreds of documents about robots.txt files and am still not completely clear on some issues regarding them.

My understanding is : Robots will go through each and every page on your website they can find wether you want them to or not. The robots.txt file simply tells the spider not to save a copy of things listed on the robots.txt file and not to add those pages to the indexes. I don't think this will ever change because the robots also gather statistics for G (and others).

How many pages does the average website block? For a search engine company to know this they would need to spider them all.

That being said I think you need to use them to ensure that some things do not get indexed like "member profiles" etc unless you have a better way. A BETTER way to keep that content hidden is to make the links to profiles etc show up only when a user is logged in.

This thread has me wondering if a webmaster needs to hide anything at all because everything is a potential link back to your site from a search engine... but then I remember that our sites get rated by a machine that can't fully comprehend the content. Oh joy!

jdMorgan

3:52 am on Jan 3, 2007 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Robots will go through each and every page on your website they can find wether you want them to or not. The robots.txt file simply tells the spider not to save a copy of things listed on the robots.txt file and not to add those pages to the indexes.

Well-behaved robots, including those from major search engines, will not fetch a page if it is Disallowed in a properly-formatted robots.txt file.

Robots.txt was originally conceived as a way for Webmasters to prevent robots from consuming excess bandwidth, and to keep them from executing cgi scripts. However, now that the Web has gone commercial, there are many other good reasons to Disallow spiders from fetching various URLs.

A second control mechanism exists in the HTML <meta name="robots" content"noindex"> tag; Its function is different, and the file containing it must not be Disallowed in robots.txt, or the robots won't be able to fetch it to "read" it.

See www.robotstxt.org and w3c.org for authoritative information.

Jim

Shawna

11:01 pm on Jan 3, 2007 (gmt 0)

10+ Year Member

User-agent: Googlebot-Image
Disallow: /

I tried this over a month ago and the google image bot is still eating up mu bandwidth like mad.I am spending an extra 10 dollars a month on bandwidth because of Google.What am I doing wrong?

Glitzer

4:36 pm on Jan 4, 2007 (gmt 0)

10+ Year Member

I'm uncertain if this will work, but can you stop it in .htaccess?

Glitzer

4:41 pm on Jan 4, 2007 (gmt 0)

10+ Year Member

Well-behaved robots, including those from major search engines, will not fetch a page if it is Disallowed in a properly-formatted robots.txt file.

If they already have pages indexed or cached that have recently been disallowed in a revised robots.txt, then they won't remove those pages the SE's already indexed.

Quite a dilemma for webmasters.

goodroi

9:17 pm on Jan 4, 2007 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

If a search engine crawls a page and then you block it with robots.txt, that page will eventually fall out of the search index.

If you accidentally allowed content to be published on two sites, you need to choose one and block the other or you will have problems with the engines.

Not sure where the dilemma is.

piplio

6:50 am on Jan 7, 2007 (gmt 0)

10+ Year Member

Set up robots.txt to prevent search engines from over-crawl.

It happened to me once and I paid a hefty price for it...

hybrid6studios

8:45 am on Jan 11, 2007 (gmt 0)

10+ Year Member

Be aware that not every spider obeys Robots.txt. There are some nasty bots out there, including but not limited to: 1) spambots that harvest email addresses from your contact forms or guestbook pages; 2) scrapers that scrpe your site for free content to be used in their spammy doorway pages; 3) downloader programs that suck your bandwidth by downloading your entire site; 4) programs that are out on the web looking for copyright infringements so they can sue people; 5) viruses & worms; 6) data mining programs; 7) hackers; 8) DDOS attacks, etc.

Many of these will actually go to Robots.txt to see what you are trying to hide or protect, and go straight to the restricted content. For this reason I use a dynamic robots.txt page.

Through proper use of .htaccess and mod_rewrite, every time my server calls up robots.txt, it invisibly serves a PHP page (although it looks the same to the viewer) and it detects what bot or browser is viewing the page. For search engine spiders I serve the real Robots.txt content for proper indexing, and for all others I simply disallow everything.