Forum Moderators: open
- robots.txt? NO
- Uneven apostrophes in UA (only closing)
- site in UA yields this oh-so-descriptive info:
<html>
<head>
</head>
<body>
</body>
</html>
----- ----- ----- ----- -----
FWIW, bona fide amazonaws.com hosts spewed at least 33 bots on two of my sites in recent months. (Does someone get paid per bot or something?) Some bots may be new to some of you; or newly renamed. Here are the actual UA strings; in no particular order:
NetSeer/Nutch-0.9 (NetSeer Crawler; [netseer.com;...] crawler@netseer.com)
robots.txt? YES
Mozilla/5.0 (Windows; U; Windows NT 5.1; ru; rv:1.8.0.6) Gecko/20060728 Firefox/1.5.0.6
[Note ru.]
robots.txt? NO
feedfinder/1.371 Python-urllib/1.16 +http://www.aaronsw.com/2002/feedfinder/
robots.txt? NO
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9b4pre) Gecko/2008022910 Viewzi/0.1
robots.txt? NO
Twitturly / v0.5
robots.txt? NO
YebolBot (compatible; Mozilla/5.0; MSIE 7.0; Windows NT 6.0; rv:1.8.1.11; mailTo:thunder.chang@gmail.com)
robots.txt? NO
YebolBot (Email: yebolbot@gmail.com; If the web crawling affects your web service, or you don't like to be crawled by us, please email us. We'll stop crawling immediately.)
[Whattaya think robots.txt is for, huh?]
robots.txt? YES ... Four times in 45 minutes
Attributor/Dejan-1.0-dev (Test crawler; [attributor.com;...] info at attributor com)
robots.txt? NO
PRCrawler/Nutch-0.9 (data mining development project)
robots.txt? YES
EnaBot/1.2 (http://www.enaball.com/crawler.html)
robots.txt? YES
Nokia6680/1.0 ((4.04.07) SymbianOS/8.0 Series60/2.6 Profile/MIDP-2.0 Configuration/CLDC-1.1 (botmobi find.mobi/bot.html) )
[Note spaced-out closing parens]
robots.txt? YES
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461) Java/1.5.0_09
robots.txt? NO
TheRarestParser/0.2a (http://therarestwords.com/)
robots.txt? NO
Mozilla/5.0 (compatible; D1GArabicEngine/1.0; crawlmaster@d1g.com)
robots.txt? NO
Clustera Crawler/Nutch-1.0-dev (Clustera Crawler; [crawler.clustera.com;...] cluster@clustera.com)
robots.txt? YES
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.7) Gecko/20060909 Firefox/1.5.0.7
robots.txt? YES
yacybot (i386 Linux 2.6.16-xenU; java 1.6.0_02; America/en) [yacy.net...]
robots.txt? NO
Mozilla/5.0
robots.txt? NO
Spock Crawler (http://www.spock.com/crawler)
robots.txt? YES
TinEye
robots.txt? NO
Teemer (NetSeer, Inc. is a Los Angeles based Internet startup company.; [netseer.com...] crawler@netseer.com)
robots.txt? YES
nnn/ttt (n)
robots.txt? YES
AideRSS/1.0 (aiderss.com)
robots.txt? NO
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
robots.txt? NO
----- ----- ----- ----- -----
These two UAs alternated multiple times one afternoon:
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)
robots.txt? NO
WebClient
robots.txt? YES
----- ----- ----- ----- -----
And finally, way too many offerings from "Paul," who's apparently unable to make up his mind, UA name-wise:
Mozilla/5.0 (compatible; page-store) [email:paul at page-store.com
robots.txt? NO
Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com)
robots.txt? YES
Mozilla/5.0 (compatible; heritrix/1.12.1 +http://www.page-store.com) [email:paul@page-store.com]
robots.txt? YES
Mozilla/5.0 (compatible; zermelo; +http://www.powerset.com) [email:paul@page-store.com,crawl@powerset.com]
robots.txt? NO
zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]
robots.txt? YES
zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]
robots.txt? YES
Mozilla/5.0 (compatible; zermelo; +http://www.powerset.com) [email:paul@page-store.com,crawl@powerset.com]
robots.txt? NO
-----
Slippery little suckers indeed. Thank goodness I block amazonaws.com no matter what.
"Requesting my robots.txt leads to a site-wide ban."I'm curious as to how you do that, and also why?
You could do it by generating the robots.txt dynamically. You hook on the 404 handler, examine the query and then synthesize the robots.txt on the fly (as you set it so you have no physical robots.txt file). You can compare the UA for a url like string, which would imply the visitor wants to enter as a spider.
Is it good to ban IPs based on it? No it's not. Or maybe it is, for the competitors. Countermeasures include, if they ever figure that out, the expose of urls via links or images inside the page HTML towards the robots.txt file or disallowed directories on the other server, triggering the later to ban real visitors and popular spiders. I wouldn't want to go down that path.
For the original issue with the amazonaws bots, they pretend to have a search engine via the UA but they're not. They have no public search page at least from the urls they advertise so I block them by host and redirect them to a blackhole. I don't even want to waste b/w for 403 or 404 content.
AmazonAWS [aws.amazon.com] is a server cloud, née farm, not a public search engine per se. (A9.com [a9.com], another Amazon-owned site, is search-related.) It's anybody's guess who/what the amazonaws.com-based bots are crawling for, literally, because the AWS cloud is a cloak.
robots.txt? YES
Ironically, given another current thread, the domain registration suggests contacting a certain .org site for more info. Good ole netsol! :)
174.129.111.#*$! - - [10/Apr/2009:19:49:26 -0400] "GET /robots.txt HTTP/1.0" 200 0 "-" "linkdexbot/Nutch-1.0-dev (http://www.example.com/; crawl at bla dot com)"
IP resolves to ec2-#*$!.compute-1.amazonaws.com and states "crawl"
But none of the recorded "crawl" links doesn't offer some public search facility. Also since I don't want to go around in circles for UAs and the various host names and how AWS changes the ips every time, I block the ultradns from the dns records and everyone else who comes from it.
PS:I altered the urls of the entry.
robots.txt? YES
-----
Related, from 11/08:
ec2-[yada-yada].compute-1.amazonaws.com
rdfbot/Nutch-1.0-dev
robots.txt? YES
-----
P.S. Bot bits:
host-202-137-236-nn.rediffdns.com
rdfbot/Nutch-1.0-dev
robots.txt? YES
host-202-137-237-nnn.rediffdns.com
IIITBOT/1.1 (Indian Language Web Search Engine; [webkhoj.iiit.net;...] pvvpr at iiit dot ac dot in)
robots.txt? YES
67.202.42.--- - - [--/May/2009:--:--:-- -0400] "GET / HTTP/1.1" 301 20 "-" "AISearchBot (Email: aisearchbot@gmail.com; If your web site doesn't want to be crawled, please send us a email.)"
In the same IP range as Pfui mentioned at the beginning of the thread.
If you check the DNS records they all point to ultradns. So instead of chasing around the various IP ranges block based on the dns. I found it to be more effective.
72.44.61.194 - - [19/May/2009:22:14:44 -0400] "GET / HTTP/1.1" 301 5 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.6) Gecko/2009011913 Firefox/3.0.6 (.NET CLR 3.5.30729) RPT-HTTPClient/0.3-3"
Checking if the ip responds to http requests (port 80) it loads up something like the main google search page. Does anyone have info about it?
[72.44.61.194...]
[72.44.61.194...]
Check out the 404 for this dir:
[72.44.61.194...]
Looks like all sorts of companies/hosts hide behind AWS. I wonder if your discovery might have anything to do with Bill's "safebrowsing diagnostic spidering possibly going on at Google that may not be the standard Googlebot [webmasterworld.com]" topic?
If nothing else, that "RPT-HTTPClient [webmasterworld.com]" UA appendage is uncool -- and seriously ancient.
Well, at least the Google-on-AmazonAWS IP didn't show me as logged in to Google...
And that's just one range while we remember discussions about strange google visits that do not appear to be from googlebot. So to me AWS looks tiny in comparison.
PS: Here is a related thread
[webmasterworld.com...]
ec2-[yada-yada].compute-1.amazonaws.com
Mozilla/5.0 (compatible; Googlebot/2.1; [google.com...]
robots.txt? NO
AMAZON-EU-AWS
Amazon Web Services, Elastic Compute Cloud, EC2, EU
79.125.0.0 - 79.125.63.255
Full range of Amazon on that block:
IE-AMAZON-20070824
Amazon Data Services Ireland Ltd
79.125.0.0 - 79.125.127.255
Full range now blocked here.
Anyone know of any other ex-USA ranges?