Forum Moderators: martinibuster

Message Too Old, No Replies

Media bot vs. Googlebot and robots.txt

         

ionchannels

5:50 pm on Nov 6, 2006 (gmt 0)

10+ Year Member



I just had an interesting exchange with adsense support about site targeting problems. A couple of my portals were experiencing very poor ad targeting and I emailed adsense support. They said the problem was due to my robots.txt which explicitly gave permission to googlebot to access the entire site, but disallowed all other bots. They said that this was preventing mediabot from accessing the site and delivering relevant ads. For the past 2 years, I have been using googlebot in my robots.txt and mediabot has also followed the same rules. Now, I am being told that this has changed. Does anyone have any thoughts on this? Here is my new robots.txt. We'll see if this improves targeting.

User-agent: Mediapartners-Google*
User-agent: Slurp
User-agent: Googlebot
User-agent: Msnbot
Disallow: /honeypot.php

User-agent: *
Disallow: /

jomaxx

6:07 pm on Nov 6, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'm not sure this is a change. IIRC the spiders have always had separate designations.

jdMorgan

6:32 pm on Nov 6, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> User-agent: Mediapartners-Google*

What is the asterisk --the star-- on the end for? I believe I'd delete that, because it may be treated literally.

Jim

ronburk

6:40 pm on Nov 6, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



That is interesting, since all the bots are supposed to share a common cache now. I guess they're saying that even though Googlebot has cached your page, MediaPartners will observe the robot restriction and not read that page from the cache, as it would otherwise do.

ionchannels

6:41 pm on Nov 6, 2006 (gmt 0)

10+ Year Member



jdmorgan:
The directive with the star was given to me by adsense support. I suppose the user agent might have several different suffixes?

jdMorgan

6:58 pm on Nov 6, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Nevertheless, you shouldn't have to put anything in your robots.txt to allow for suffixes. The robots.txt Standard recommends that robots use a case-insensitive substring (prefix) match, without version number.

Jim

ionchannels

9:39 pm on Nov 6, 2006 (gmt 0)

10+ Year Member



You are all absolutely right, the asterisk should not be there. This must have been a typo from the adsense rep. The correct form is:

User-agent: Mediapartners-Google

I confirmed this with the webmaster tools from google. It is true that you need to include both googlebot and mediapartners-google if you want to explicitly allow both while denying other bots access.

ionchannels

10:01 pm on Nov 6, 2006 (gmt 0)

10+ Year Member



OK, this is strange, the google adsense help page:
[google.com...]
also includes the asterisk at the end of the mediabot user agent. However, the google robots.txt tool claims this is invalid. Very strange...

jdMorgan

11:03 pm on Nov 6, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You might want to e-mail both groups -- The 'regular search' group and the Mediabot group, point out the discrepancy, and ask them to resolve it. Hopefully, that will get the two groups talking to each other, and result in correct and consistent documentation and recommendations.

Jim

jimbeetle

11:17 pm on Nov 6, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The problem we have lately is that Google and Yahoo are starting to extend robots.txt with wildcards and other directives.

Using the Google robots.txt validation tool is very, very dangerous as it is only validates for what Google respects (with some discrepancies as noted above). If you use wildcards and things like the "Allow" directive and validate through Google, don't be surprised if no other robots respect your robots.txt.

It's safer to do whatever it takes to stick with the current standard.

maxgoldie

11:48 pm on Nov 6, 2006 (gmt 0)

10+ Year Member



This might be something -- but this past weekend, my four yr old, 100% clean site was banned from Google, completely overnight. When I looked at the diagnostics tab in Google Sitemaps, it tells me that I have 38 HTTP header errors for every directory of mine.

For the past two years, I have agressively blocked empty referrers and most bots with my htaccess file, except for Google. Now I have emptied most of the directives in my htaccess file, except for the essentials to see if things change soon.

Maybe something changed recently with the user agent string for G's bots?

jomaxx

11:58 pm on Nov 6, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It seems odd to use an asterisk in the user agent, but that is the way that Google's documentation says to do it.

Ultimately that's just the arbitrary name that the Mediapartners spider chooses to answer to and obey. It could be anything; it doesn't need to correspond in any particular way to the actual user agent that the spider passes. In other words, this is different from using an asterisk in a "disallow" statement in the robots.txt file.

ionchannels

1:22 am on Nov 7, 2006 (gmt 0)

10+ Year Member



FWIW, I just got confirmation from adsense support that the asterisk is the correct way to identify the mediabot. I've asked that this be communicated to the search side so that the documentation can be corrected i.e. the robots.txt tool on google webmaster tools.