Forum Moderators: open

Message Too Old, No Replies

Google Wireless Transcoder

74.125.74.nnn acting badly

         

Umbra

1:35 pm on May 3, 2009 (gmt 0)

10+ Year Member



Since March, logs show activity from 74.125.74.nnn ... Sometimes it's "Google Wireless Transcoder" and sometimes it's a browser user agent plus "(via translate.google.com)"

However, there's also a handful of hits from that same specific ip address with bad user agents like "" and "\x90\xdf\xd6\x13"

We've been (probably justifiably) harping on Microsoft for all those fake referers, but you know, Google sends out a mad jumble of transcoders, translators, accelerators, sitemaps and keyword tools, and sometimes blank user agents, some or all of which have no reverse dns, and who knows what would happen if we block one or more of those ip ranges, and now we have one confused ip address which can't decide if it's a transcoder or a translator, and occasionally allows itself to get hijacked by blank or malformed user agents

[edited by: incrediBILL at 3:18 pm (utc) on May 3, 2009]
[edit reason] Obscured IPs [/edit]

Hobbs

3:53 pm on May 3, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I am blocking agents containing "Transcoder" for sanity, but very interested to hear how others treat them.

Samizdata

5:16 pm on May 3, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I allow Google Wireless Transcoder to access my mobile-specific content.

But Umbra's point is well made.

What I would like to see from Google is more attention to these matters, with proper documentation (on a public webpage) and rDNS implemented for all their bots.

People should not have to come to WebmasterWorld to find out what they are.

...

dstiles

5:38 pm on May 3, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Google's IP ranges are a mess in many instances. Some have rDNS, others don't and what comes from which often seems interchangeable.

My own approach is to block google accesses for bad UAs / Headers but not block the IP itself, which I would do for similar activity from elsewhere.

It really is time google defined blocks for legit bots, test bots, test browsers, proxies and tootls such as translator. It's not as if they are short of IPs.

Microsoft and Yahoo are in a similar mess and should consider the same.

Of course, once you know about specific ranges of IP there is the option to block them. As I do with a certain range of MS IPs.

Umbra

6:25 pm on May 3, 2009 (gmt 0)

10+ Year Member



It really is time google defined blocks for legit bots, test bots, test browsers, proxies and tootls such as translator. It's not as if they are short of IPs.

Microsoft and Yahoo are in a similar mess and should consider the same.

I would humbly suggest that Google, Microsoft, and Yahoo enforce a protocol for all spider and webtool development.
1) User agent must be descriptive
2) User agent must include a link to a webpage with more info (and webpage must actually work!)
3) Must set up a reverse dns that is meaningful (ie., not x234.abc.30a8x12.google.com)
4) Must include proper headers
5) Only 1 user agent per ip address at a time (and at least try to keep similar services in the same ip block)
6) If the bot or tool cannot comply with any of the above (because its a stealth/cloaking checker or because it's just a one-time test), it must check for a response code and if it gets a 40x or 50x response, it must not result in any direct penalty or indirect opportunity cost to the website

If any web developer at Google/Yahoo/Microsoft fails to go through their protocol checklist, he/she should not be given the green light from the management.

dstiles

7:33 pm on May 3, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'll go with that. :)

Plus proxies & public tools to say so in rDNS - I don't mind legit translators, providing they don't scrape, but I'm quite prepared to dump stealth proxies.

And for the bots to OBEY robots.txt - including the stealth junk that's around. I'm currently getting msnbot falling into traps and suiciding because their bot disobeys robots.txt, and yahoo and google stealth junk ignores it as well. Not helped by invalid headers, either, but that's common for google and yahoo as well.

It would also be nice if someone could come up with a better robots.txt. The current one is the same poor quality as most of the web: html, css, dns etc are all badly designed with second thoughts added later.

So: bets on when this will become implemented? I'm reckoning on about December 2030, by when google, msn and yahoo will have been over-taken by something better... Yeah, right! :(

Umbra

3:42 pm on May 6, 2009 (gmt 0)

10+ Year Member



I am blocking agents containing "Transcoder" for sanity, but very interested to hear how others treat them.

At Project Honeypot, the IP ranges for Google Wireless Transcoder (74.125.* and 64.233.* and 66.249.*) indicate some email harvesting and spamming. Project Honeypot doesn't try to measure scraping, so that's an unknown.

I took a sampling of IP addresses from the HTTP_X_FORWARDED_FOR header. I don't know if it's an indicative sample, but 40% of users were from India, 14% from USA, 3% from Canada, and rest from Indonesia, Thailand, Pakistan, etc.

I'm very tempted to deny all GWT IP ranges, just because it's so easy to exploit it as a proxy scraper. I'd block a tiny number of North American users, but I bet that sales conversions is almost zero on mobile browsers that need GWT, because iPhones and Blackberries have their own transcoder, don't they?

Plus, as far as I know, nobody at Google has ever officially or unofficially addressed webmaster concerns about GWT (just like they were quiet about Google Web Accelerator's flaws), so it's not like I can get any counterarguments from the other side.

enigma1

4:31 pm on May 6, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If you detect any of the proxy headers and they're a lot published (even more are unknown) you could block access. The thing is, the particular ip (google translator) doesn't resolve properly if you do rdns and back to the ip so that alone should be an indicator. I see the same thing in my server logs. This type of access is blocked in my case.

Of course this doesn't help much as lots of scrap attempts pass all checks and they could be initiated from compromised systems anywhere in the world who act as elite proxies, among other things.