Forum Moderators: open
I have modified my robots.txt file recently, but not in a way that should cause this, I think.
BUT I did put in a rule that mentioned Slurp specfically; would that mean it won't read the
User-agent: * directives that come later in the file?
User-agent: Googlebot-Image
Disallow: /
#
# User-agent: Mediapartners-Google
# Disallow: /
#
User-agent: msnbot
Disallow: /bloop/
Disallow: /blop/
#
User-agent: googlebot
Disallow: /bloop/
Disallow: /blop/
#
User-agent: Slurp
Disallow: /bloop/
Disallow: /blop/
#
User-agent: *
Disallow: /shop/
Disallow: /forum/
Disallow: /cgi-bin/
74.6.86.210 went for a page in a disallowed directory but which is linked to from outside that directory
74.6.86.70 went for a page in the same disallowed directory which is not linked to from anywhere outside that directory
74.6.87.71 went for the default of the same disallowed directory - which does not exist - and got a 403.
The site has one User-agent: * in the robots.txt disallowing that directory to all.
A blank line is required after each record, whether or not you have comment lines starting with "#". There's even one 'bot that used to consider a robots.txt file to be invalid without a blank line at the very end.
Just a thought.
Jim
User Agent : Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]
Server time : 21:25:30 1/09/2007
Apparent IP : 74.6.86.220
Remote Host : ct501087.inktomisearch.com
That is correct. When there is an agent-specific rule, the crawler applies that rule, not the generic rule. The "User-Agent: *" rule is used only if no other rule matches, not applied in addition to other matching rules.
His listed /robots.txt file says:
User-agent: Slurp
Disallow: /bloop/
Disallow: /blop/
#
User-agent: *
Disallow: /shop/
Disallow: /forum/
Disallow: /cgi-bin/
So Slurp is disallowed from /bloop/ and /blop/, but not disallowed form
/shop/, /forum/ and /cgi-bin/.
User-agent: *
Disallow: /shop/
Disallow: /forum/
Disallow: /cgi-bin/
Disallow: /badbottrap/
User-agent: Slurp
Disallow: /bloop/
Disallow: /blop/
i.e. will all bots avoid the bottrap directory, and Slurp also avoid the specified ones, if they're obedient little bots?
Now, big question and possibly big can of worms: Do other 'bots process specific and generic directives the same as Slurp? Or, do some obey both specific and generic? I've never happened to run across anything on this.
To make sure, just add the extra Disallow directives to the Slurp block (and the other Bots'). Problem solved.
User-agent: *
Disallow: /shop/
Disallow: /forum/
Disallow: /cgi-bin/
Disallow: /badbottrap/
User-agent: Slurp
Disallow: /bloop/
Disallow: /blop/
Disallow: /shop/
Disallow: /forum/
Disallow: /cgi-bin/
Disallow: /badbottrap/
User-agent: msnbot
Disallow: /bloop/
Disallow: /blop/
Disallow: /shop/
Disallow: /forum/
Disallow: /cgi-bin/
Disallow: /badbottrap/
I think: Put the wildcard ones first, for all the other, less important bots. Then you put the bot-specific ones.
If a bot comes across its name, it'll stop there.
I _think_ that's right :)
My objective here: To stop the most important SE bots indexing the insubstantial CMS pages in /bloop/ and /blop/, except the Adsense bot :).
Thanks for the help.
Tech support found the problem and I was just notified that there was, indeed, a problem with the A-name setup and the error has been corrected.
It looked like an infinite loop was happening, the way Slurp was hammering away, and I believe that there's still something Yahoo needs to check into, since theirs is the only crawler that ran into this mess.
I filled out the support form for Search yesterday with as much detail as I could at the time, and wish there had at least been an auto-response so that I could get some more details from the host, in addition to what I've already found out, and give them a follow-up because if it's happened to some now it could well happen to others in the future.
User-agent: Slurp
Disallow: /bloop/
Disallow: /blop/
Disallow: /shop/
Disallow: /forum/
Disallow: /cgi-bin/
Disallow: /badbottrap/
User-agent: msnbot
Disallow: /bloop/
Disallow: /blop/
Disallow: /shop/
Disallow: /forum/
Disallow: /cgi-bin/
Disallow: /badbottrap/
User-agent: googlebot
Disallow: /bloop/
Disallow: /blop/
Disallow: /shop/
Disallow: /forum/
Disallow: /cgi-bin/
Disallow: /badbottrap/
User-agent: *
Disallow: /shop/
Disallow: /forum/
Disallow: /cgi-bin/
Disallow: /badbottrap/
The idea is the main bots get their to their directives, and stop. The rest carry on to the wildcard.
That is EXACTLY what happened in my case, and once my host identified and corrected the issue, it stopped completely and has been 100% back to normal.
I've received an exceptionally nice response from the support team, including a reference to published information on Yahoo's site about what to do about crawl issues
How can I reduce the number of requests you make on my web site? [help.yahoo.com]
That would apply under normal circumstances, but in my case it was a situation of being caught in an endless loop - now fixed, but it took some digging to find the cause. I also got a ton of referrers from other sites on the same server, which showed up in Webalizer, of all places.
More information on the YSearch Blog
[ysearchblog.com...]
[edited by: Marcia at 9:09 pm (utc) on Jan. 18, 2007]