Forum Moderators: open
However, there's also a handful of hits from that same specific ip address with bad user agents like "" and "\x90\xdf\xd6\x13"
We've been (probably justifiably) harping on Microsoft for all those fake referers, but you know, Google sends out a mad jumble of transcoders, translators, accelerators, sitemaps and keyword tools, and sometimes blank user agents, some or all of which have no reverse dns, and who knows what would happen if we block one or more of those ip ranges, and now we have one confused ip address which can't decide if it's a transcoder or a translator, and occasionally allows itself to get hijacked by blank or malformed user agents
[edited by: incrediBILL at 3:18 pm (utc) on May 3, 2009]
[edit reason] Obscured IPs [/edit]
But Umbra's point is well made.
What I would like to see from Google is more attention to these matters, with proper documentation (on a public webpage) and rDNS implemented for all their bots.
People should not have to come to WebmasterWorld to find out what they are.
...
My own approach is to block google accesses for bad UAs / Headers but not block the IP itself, which I would do for similar activity from elsewhere.
It really is time google defined blocks for legit bots, test bots, test browsers, proxies and tootls such as translator. It's not as if they are short of IPs.
Microsoft and Yahoo are in a similar mess and should consider the same.
Of course, once you know about specific ranges of IP there is the option to block them. As I do with a certain range of MS IPs.
It really is time google defined blocks for legit bots, test bots, test browsers, proxies and tootls such as translator. It's not as if they are short of IPs.Microsoft and Yahoo are in a similar mess and should consider the same.
I would humbly suggest that Google, Microsoft, and Yahoo enforce a protocol for all spider and webtool development.
1) User agent must be descriptive
2) User agent must include a link to a webpage with more info (and webpage must actually work!)
3) Must set up a reverse dns that is meaningful (ie., not x234.abc.30a8x12.google.com)
4) Must include proper headers
5) Only 1 user agent per ip address at a time (and at least try to keep similar services in the same ip block)
6) If the bot or tool cannot comply with any of the above (because its a stealth/cloaking checker or because it's just a one-time test), it must check for a response code and if it gets a 40x or 50x response, it must not result in any direct penalty or indirect opportunity cost to the website
If any web developer at Google/Yahoo/Microsoft fails to go through their protocol checklist, he/she should not be given the green light from the management.
Plus proxies & public tools to say so in rDNS - I don't mind legit translators, providing they don't scrape, but I'm quite prepared to dump stealth proxies.
And for the bots to OBEY robots.txt - including the stealth junk that's around. I'm currently getting msnbot falling into traps and suiciding because their bot disobeys robots.txt, and yahoo and google stealth junk ignores it as well. Not helped by invalid headers, either, but that's common for google and yahoo as well.
It would also be nice if someone could come up with a better robots.txt. The current one is the same poor quality as most of the web: html, css, dns etc are all badly designed with second thoughts added later.
So: bets on when this will become implemented? I'm reckoning on about December 2030, by when google, msn and yahoo will have been over-taken by something better... Yeah, right! :(
I am blocking agents containing "Transcoder" for sanity, but very interested to hear how others treat them.
At Project Honeypot, the IP ranges for Google Wireless Transcoder (74.125.* and 64.233.* and 66.249.*) indicate some email harvesting and spamming. Project Honeypot doesn't try to measure scraping, so that's an unknown.
I took a sampling of IP addresses from the HTTP_X_FORWARDED_FOR header. I don't know if it's an indicative sample, but 40% of users were from India, 14% from USA, 3% from Canada, and rest from Indonesia, Thailand, Pakistan, etc.
I'm very tempted to deny all GWT IP ranges, just because it's so easy to exploit it as a proxy scraper. I'd block a tiny number of North American users, but I bet that sales conversions is almost zero on mobile browsers that need GWT, because iPhones and Blackberries have their own transcoder, don't they?
Plus, as far as I know, nobody at Google has ever officially or unofficially addressed webmaster concerns about GWT (just like they were quiet about Google Web Accelerator's flaws), so it's not like I can get any counterarguments from the other side.
Of course this doesn't help much as lots of scrap attempts pass all checks and they could be initiated from compromised systems anywhere in the world who act as elite proxies, among other things.