At Home with the Robots My htaccess was getting fatter and the Ignore list was getting longer. Time for some housekeeping. This means: go back to the beginning and look at all my robots as if I'd never met them before. Within reason, ahem.
At the beginning of January I commented-out the whole "Ignore" list and about half the IP blocks in htaccess. (Keeping the ones that are actively malign, or that come from Belarus or something. I did say "within reason".) A few places I goofed and had to restore the block, but most of them stayed open until the end of the month. After the usual log-wrangling and some extra stuff, I ended up with a bunch of robot information that I normally don't even look at.
Disclaimer: My robots.txt doesn't call anyone by name, except the w3c link checker which has to be allowed in everywhere. (I can't ask it to check links and then refuse to let it in!) I figure bad robots will just ignore robots.txt anyway, so they go straight into htaccess. There's also nothing about crawl-delay. Same reasoning. Finally, most excluded directories are images, so robots don't have a lot of opportunity to violate robots.txt. One html file in a blocked directory is only two links away from the front page. That's the one I look for.
When I say "MSIE Generic" I mean
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1). It's especially popular with Asian robots.
No Skin off my Nose: There's Googlebot... Or, I guess, plural googlebots. Along with the plain googlebot I get the imagebot and the mobilebot.
Google Search:
IP: 66.249.63-95
UAs:
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Googlebot-Image/1.0
Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)
The google family is a bit random about robots.txt. Generally they'll pick it up once or twice a day and-- I assume-- spread the information around the rest of the family.
Faviconbot:
IP: 74.125
UA: blank
If I didn't have a <files> exemption for favicon.ico, the faviconbot would never get in at all.
Google Preview:
IP: 66.102.0-15; 74.125; 209.85
UA: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.51 (KHTML, like Gecko; Google Web Preview) Chrome/12.0.742 Safari/534.51
Google Preview is not a robot. At least not in its own eyes. It goes everywhere, including roboted-out directories; I've had to htaccess it to keep it from grubbing around in piwik.
Google Translate:
UA:
human user's UA, plus ,gzip(gfe) (via translate.google.com)
OR
UA:
human user's UA only, plus referer translate.googleusercontent.com/
Translate gets two lines of its own in the anti-hotlinking routine, because there are actual humans at the far end of those g### referers.
and bingbot too... The single most common robot if you count all requests. Leave out robots.txt and it drops to #2, behind-- nope, guess again.
I always assumed everything coming from the assorted bing/MSN ranges was much of a muchness. WRONG. There are three entirely distinct critters.
IP ranges: 65.52, 157.55, 207.46
plus 65.54.247.145 for BingSiteAuth (BWT)
bingbot
UA: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
Harmless apart from its unhealthy appetite for robots.txt files. In the course of the month, something like 80% of
all robots.txt requests came from the bingbot. Its record was 102 consecutive hits over a period of about 24 hours.
MSN media
IP: 207.46
UA: msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)
An impeccably behaved robot. Each visit has three stops: robots.txt, an image file, and the html file that the image belongs to. On the very last day of the month it had a kind of seizure and got twelve consecutive robots.txt files, but otherwise utterly predictable.
The plainclothes MSIE-bot
IP: 207.46
UA:
begins with Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1 or 5.2
and then varies at random Bad robot. Apart from always claiming to be MSIE 7, the UA changes every time. robots.txt? Who, me? I'm just a human using MSIE 7. Each visit has two or three pickups: a random html file, and any
non-image subsidiary files like css or js. This generally results in a lockout when it tries to get piwik.js.
At the end of the month I blocked this non-robot in htaccess. The rising Yandex star... IP (Russia): 77.88.30.248; 95.108.151.244; 178.154.143.83
IP (US): 199.21.99.80, ..106
UA:
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)
Busy robot, and currently very well-behaved. If you exclude robots.txt, Yandex was my single most common visitor this month. Like the Googlebot, it picks up robots.txt once or twice a day, and does what it says. The imagebot comes by less often, and always starts with robots.txt. The 95.108. range used to be used by YandexImages only, but has now been taken over by the regular YandexBot while Images use 178.154. But almost all Yandex visits this month came from their new-to-me US range. They never used it before this calendar year.
and the rest... IA archiver IP: 174.129.237.157
User-Agent: ia_archiver (+http://www.alexa.com/site/help/webmasters; crawler@alexa.com)
IP: 207.241.224.41
User-Agent: ia_archiver(OS-Wayback)
Some people hate this robot on principle, some love it. As a robot it behaves perfectly well. Each separate visit begins with robots.txt. Late in the month they tried out a new robot, ia_archiver(OS-Wayback), from a different IP, 207.241.224.41. The range belongs to TIA but I checked with them and they confirmed it's their robot. I believe this is the only time anyone has
ever answered an "Is this your robot?" e-mail. Kudos.
exabot IP: 193.47.80.81
UA: Mozilla/5.0 (compatible; Exabot/3.0; +http://www.exabot.com/go/robot)
robots.txt, random file; robots.txt, random file. Yawn.
Ezooms IP: 208.115.111.72, ..113.88
UA: Mozilla/5.0 (compatible; Ezooms/1.0; ezooms.bot@gmail.com)
Same behavior as exabot. Distinguishing feature: UA that includes a gmail address. Normally blocked because they share an IP with the not-so-nice dotbot. But dotbot hasn't been around in a while. They probably have a wider IP range, but during the whole month they only used those two.
findlinks IP: 77.20-21; 130.82-84; 139.18
UA for 77.20-21: findlinks/2.0.2 (+http://wortschatz.uni-leipzig.de/findlinks/)
UA for 130.82-84: same except 2.0.4
UA for 139.18: same except 2.1.5
I really don't know what they're up to, but you can't get much more respectable than Leipzig University. Various IPs, all belonging to Leipzig U. Info page says alarmingly "Changes in [robots.txt] are noticed at the latest after 30 days." They don't mean "noticed", they mean "take effect"; the German side says "Änderungen ... wirken sich ... aus". I doubt this is true, though, because they pick up robots.txt every single time.
gigabot IP: 64.22.106.82
UA: Gigabot/3.0 (http://www.gigablast.com/spider.html)
The name is wishful thinking. This has to be the Most Boring Robot Ever. robots.txt, front page, robots.txt, front page, and so on through the month. I am not a front-driven site; the front page never changes.
MJ12bot User-Agent: Mozilla/5.0 (compatible; MJ12bot/v1.4.1; [
majestic12.co.uk...]
Obvious problem: they refuse to have their own IP, so you can't tell if it's the real thing or a spoofer. They also seem to have a lot of trouble getting names right: constant directory-slash redirects alternating with top-level www redirects. One time they asked for the entire contents of the /hovercraft/directory, but they forgot to include the directory name. Result: 404, 404, 404.
orangeask IP: 50.23.239.14
UA: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)
note the space Distinguishing feature: uses its name as referer. Lives in an unsavory neighborhood, but its own behavior is almost as boring as gigabot. Shrug.
picsearch IP: 217.212.224.183 for robots.txt, ..181 for files
User-Agent: psbot/0.1 (+http://www.picsearch.com/bot.html)
I don't know why they use this name. They didn't ask for a single picture all month, just a handful of html files. (They did pick up a picture or two in early February.)
seznambot IP: 77.75.77.11, .17
UA: SeznamBot/3.0 (+http://fulltext.sblog.cz/)
I don't read Czech so I have no idea what this recent arrival is all about. Showed up in mid-October. UA points to a blog-- in Czech-- which lives in a domain that looks like a standard ISP. But they ask for robots.txt and seem to have a literary bent. Most robots avoid the ebooks directory like the plague; seznam goes straight for it.
YahooCacheSystem IP: 98.139.241.24n
UA: YahooCacheSystem
At some time in the past this got onto my Ignore list. At month's end I blocked them by htaccess, so they are now on the I Don't Like Your Face side. Every single visit is two stops: front page + favicon. robots.txt? What's that?
Yeti and Baidu (Japan) IP: 61.247.204 (Yeti), 119.63.196 (Baidu)
UA:
Yeti/1.0 (NHN Corp.; [
help.naver.com...]
Baiduspider+(+http://www.baidu.com/search/spider.htm)
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
For some reason I never looked up Yeti until this month. They're from Korea. All they ever do is pick up robots.txt followed by the front page. That puts them in "no skin off my nose" territory. I lumped them together with the Japanese Baidu (same UAs as China, different IP) which hardly ever asks for anything but robots.txt
The rest of the rest: Finally there are the visitors who must have some pretensions to robot-dom, because they asked for robots.txt. Most of them went on to ask for the front page and nothing more. Honestly, if that's all they're going to ask for, I don't even particularly care if they got robots.txt or not.
I Don't Like Your Face: Baidu (China) and the gang: IP:
Baidu: 123.125.71 (123.112-127); 180.76.5-6 (also 180.149.128-149); 220.181.108
Soso: MSIE Generic 124.115.1.7-8, Sosospider 124.115.6.13 for robots.txt only
Sogou: 220.181.125.68 (220.160-191)
UA:
Baiduspider+(+http://www.baidu.com/search/spider.htm)
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html) both same as Baidu-Japan
Baiduspider-image+(+http://www.baidu.com/search/spider.htm)
Sosospider+(+http://help.soso.com/webspider.htm)
Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)
Baidu-China and Baidu-Japan use the same two UAs, they just operate from different IPs. I also get the sosospider, sogou, jikespider... Aw, heck, just lock out the whole country.
The sosospider seems to have been demoted-- or purged or whatever they do to robots-- in September. Now it is only allowed to identify itself when it asks for robots.txt. The rest of the time it comes in as MSIE Generic. Conversely, Sogou is embarrassed to be seen asking for robots.txt, so it takes its clothes off.
There are a bunch of other Chinese robots but there's a dreary sameness to them. They hardly ever ask for anything but the front page, and they all without exception wear MSIE Generic. And, like MJ12, they seem to have a terrible time remembering that it's
with www. to get in. Or to get locked out, as the case may be.
The Ukrainians IP: 92.249.0-127, 109.120.128-191, 178.136-137, 193.106.136-139, 213.110.128-159
UA:
random I suppose everyone has Ukrainians. Not necessarily from the Ukraine, of course. ("A John Wayne movie doesn't have to have John Wayne in it.") They used to make me absolutely livid. But I've grown fond of them because, well, they don't
do anything. They swing by with a made-up UA and obviously bogus referer (this months's favorite: vampira.ru), pick up three copies of a page that's no earthly use without the accompanying images, and leave.
They've got two basic patterns: the three-hit visit and the six-hit visit.
Three-hit: put on a random UA, make up a referer, ask for three successive copies of Know Your Lion, leave.
Six-hit: same, only this time each request is in pairs: Lion again, and front page. Change UAs between each pair, but keep the same referer.
Somewhere along the line they've picked up a new guilty thrill: spoofing the googlebot. This is so exciting that when they make a six-hit visit they don't change UAs but stick with the ersatz googlebot all the way through. They've also got a variant where the Googlebot string is preceded by \xef\xbb\xbf. I don't know what encoding this is, if any, but � is one form of the "zero width no-break space". In other words, it's supposed to be invisible but doesn't quite work that way.
At one point they arrived from a new IP that I hadn't blocked (the 213.110 range). Obviously the same robot. But they didn't do anything different or ask for any additional files.
ahrefsbot IP: 213.186.127.6
User-Agent: Mozilla/5.0 (compatible; AhrefsBot/2.0; +http://ahrefs.com/robot/)
I have no idea what, if anything, they're about. I just know that they seem to think robots.txt is non-perishable: about once a month they pick up three copies in a batch, and then carry on regardless. Don't know whether they even read it; they don't dig deeply enough for me to be sure.
facebookexternalhotlink IP: 69.171.224-255; 66.220.144-159
UA: facebookexternalhit/1.0 (+http://www.facebook.com/externalhit_uatext.php)
or 1.1 I may not have their name
exactly right, because I can't pretend I've never seen these guys before. They Just Annoy Me. No robots.txt, of course. They're blocked from all images because their only reason for getting the files is to facilitate hotlinks. They do get a few html files. No idea what they do with them; they're invariably logged as 206. (I have no idea what 206 means. I know that if it's part of an "if/then" condition it means "If the file has changed, give it to me, and otherwise just give me the HEAD". But logs don't say if that was the form of the request.)
Odd detail I just noticed: The 1.0 UA string is used for one specific image. The 1.1 version is for everything else. It doesn't coincide with anything obvious like the first time they tried for the file.
oBot IP: 206.253.224
UA#1: Mozilla/5.0 (compatible; oBot/2.3.1;+http://www-935.ibm.com/services/us/index.wss/detail/iss/a1029077?cntxt=a1027244)
UA#2: Mozilla/5.0 (compatible; oBot/2.3.1; +http://filterdb.iss.net/crawler/)
Remember when everyone was paranoid about IBM? This is their robot. They're apparently returning from a previous visit over a year ago, because they came in asking for a batch of image files that used to exist but no longer do. A week later they came back, picked up all the replacement images, plus one wildly random painting-- and then failed to get a few files belonging to "Three Blind Mice" because they got the casing wrong.
Oh, yes, they did start each visit by asking for robots.txt. But since one of the image files they asked for-- twice-- is in a roboted-out directory, they can't have read it very carefully.
Trendmicro IP: 150.70 and 216.104.15
User-Agent: MSIE Generic
Some folks don't mind them. I say I don't like their face, even when they're tagging along behind a human. This happens sometimes but not consistently.
websense IP: 208.80.192-199 (this month they only used ..194)
User-Agent: begins Mozilla/4.0 (compatible; MSIE 6.0; or 7.0; the rest varies
One of the first robots I ever blocked. I think "sense" is a euphemism for "censor". Could not reuse a UA to save its life, even if it's just been redirected and never even got to use the first UA. They have never asked for anything but my front page. That shows how much they annoy me. Generally the front page doesn't count, because there's nothing on it.
They have never visited me from their other IP, 208.87.232-239. It's much worse. During this same month they blasted my art studio's site, picking up every single file-- most of them roboted-out-- in one visit. Luckily they don't read javascript, so they didn't get the full-size images. Oh, and they forgot robots.txt over there too.
auto-referers Is this the latest robotic trend? My site happens to be laid out so no page ever includes a link to itself. (As a user, I get annoyed and confused by sites that do this. "Uhm, wait, wasn't I on this page already?") But robots like doing it. They probably think-- correctly-- that this makes it easier to slip under the site's radar. I'm ### if I can hammer out the right wording for htaccess. Most requests are for e-books-- which are in the public domain anyway-- so it's definitely The Principle of the Thing.
snoops This batch makes me uneasy because I don't know which ones are legitimately collecting information and which ones are up to no good. Their referer typically has a query involving my sitename, along with something vaguely legit-sounding. About half of them are blocked by IP.
"mydomain.com" is me. I've deleted the ht
tp://element from the referer, and replaced one word with "zzz" because apparently you're not allowed to say it.
January's batch is typical:
IP: 62.219.132.115 (twice)
Referer 1: www.we-globe.net/Weblab/SiteCommonGraveReport/mydomain.com/
Referer 2: www.we-globe.net/Weblab/SiteCommonGraveReport/www.mydomain.com/
UA: Mozilla/5.0 (Windows; U; Windows NT 5.1; fr; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.8
IP: 64.246.nn.nn (twice)
Referer: whois.zzz/mydomain.com
UA: Mozilla/5.0 (Windows; U; Windows NT 5.1; en; rv:1.9.0.13) Gecko/2009073022 Firefox/3.5.2 (.NET CLR 3.5.30729) SurveyBot/2.3 (zzz)
IP: 77.78.109.76
Referer: www.pagesinventory.com/domain/www.mydomain.com.html
UA: Mozilla/5.0 (Windows; U; Windows NT 5.1; cs; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11
IP: 88.214.193.166 (twice)
Referer 1: www.zzz.com/info/mydomain.com
UA 1: Mozilla/5.0 (iPod; U; CPU iPhone OS 4_1 like Mac OS X; sv-se) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7
Referer 2: www.whorush.com/search/?q = mydomain.com
UA 2: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SRS_IT_E879027FB2765B543EA993; SRS_IT_E8790576B376555131AE90)
IP: 176.9.87.106
Referer: whois.zzz/mydomain.com
UA: Mozilla/5.0 (Windows; U; Windows NT 5.1; en; rv:1.9.0.13) Gecko/2009073022 Firefox/3.5.2 (.NET CLR 3.5.30729) SurveyBot/2.3 (zzz)
Shoot to Kill ... and then there are the really malign robots. These are the ones that ask for php files-- which I haven't got, luckily. One of them did something so outrageous, the host blocked them at the door. This only happens about twice a year. When the requests involve
/admin/sqlpatch.php/password_forgotten.php
you are probably right to be suspicious.
Round off the month with assorted "wp-login", "phpmyadmin" and the obligatory "muieblackcat". These are all blocked in htaccess just on principle. A 403 is more satisfying than a 404.
One-Offs If we all listed our one-hit wonders, there would be no end to it. So this is just the highlights.
Stupid Robot (I couldn't think of a better label :))
IP: 159.253.145.175
UA: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:8.0) Gecko/20100101 Firefox/8.0
The month got off with a bang. With 394 hits in 50 seconds I would put it at the top of the ### list ... if it weren't for its mind-boggling, over-the-top, jaw-dropping, have-to-see-it-to-believe-it stupidity.
Oh, all right. Its human programmer's stupidity. In particular, it was completely clueless about what comes after <a. So <a class = "aaa", <a name = "aaa", <a href = "#aaa" and <a rel = "aaa" were all read as <a href = "aaa". This naturally led to a lot of 404s-- and also kept them from following no-follow links.
Besides ::cough-cough:: it drew my attention to a few overlooked links to /index.html. Also a whole set of relative links, where
../otherdirectory/Filename.html
within /directory/ got interpreted by the robot as
/directory/otherdirectoFilename.html
Yahoo! Slurp IP: 98.137.72.218
UA: Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; [
help.yahoo.com...]
This too was on my Ignore list, but has now gone off to join the YahooCacheSystem in htaccess-land. It only stopped by once in the course of the month-- slurping up a complete page with all subsidiary files except, for some reason, piwik.js. robots.txt? Hahaha.
robot from Lightspeedsystems (for want of a better name)
IP: 69.84.207.147
UA for html: Mozilla/4.0 (compatible; MSIE 7.0;Windows NT 5.1;.NET CLR 1.1.4322;.NET CLR 2.0.50727;.NET CLR 3.0.04506.30) Lightspeedsystems
UA for images: Mozilla/4.0 (compatible; MSIE 7.0;Windows NT 5.1;.NET CLR 1.1.4322;.NET CLR 2.0.50727;.NET CLR 3.0.04506.30)
note spacing in both OK, here I goofed. The range hasn't been around in ages so I unblocked it for the sake of experiment-- and then had to quickly reblock it after the horse was stolen. To really appreciate these guys you'd have to see a screenshot of my processed logs. It's all color-coded; that means stripes, because every single directory had to get redirected from without-slash to with-slash form. (It did not get these links from me.)
And it went through its list in straight order of name length, from the three letters of /fun/ to the ten of /hovercraft/. Only the top-level directories-- a total of six times at intervals of a few hours. Each request was logged as 301 followed by 206. At the end of each set it picked up one or two images. This time for variety's sake it asked for the wrong domain name, so it got a www. redirect instead.
This is a scary robot. I blocked it a long time ago. It comes by every now and then, makes a set of three attempts, and goes away. This time it wasn't blocked, so it stuck around. Robots don't normally do that. They come in with a shopping list and it makes absolutely no difference whether every single request gets a 403. They carry on regardless.
HTTrack I've complained about them elsewhere, so I won't repeat it here. Ugh, ugh, ugh.
mail.ru IP: 217.69.133.30
UA: Mail.RU/2.0 (I've also seen .Ru and .ru)
With a UA like that, you'd think they must be up to no good. But really they're harmless. They come by at long intervals-- just once in the course of this month-- ask for robots.txt, one page and all its associated images. The page is one that I made in 1998 and basically haven't touched since. No skin off my nose.
Postscript: The robots must have been on a post-holiday hiatus. Early this month (February) I got walloped by two of the worst I have
ever met.
First came a
Bezeq tag team.
IP#1: 62.219.8.228
UA#1: four different quasi-humans at random. I was especially struck by (in full)
Mozilla/6.0 (compatible). News to me.
IP#2: 192.114.71.13
UA#2: blank
The first member of the team raced through and grabbed every single page on the site, including the roboted-out ones, at a rate of about 2 files/second. Well, in fairness, they didn't
know the pages were roboted-out, because they didn't stop to read robots.txt. They also picked up all image blowups that have <a href> links (free-standing image, not on a page). I've seen this pattern elsewhere: link rather than filetype.
The second member of the team showed up about 20 minutes after the first one left. They requested, as far as I can tell,
all accessible images, at a rate of about 4 files/second. ("Accessible" meaning they're linked from somewhere in <img src> form.)
The good news is that I'd previously met IP#2, so even if they hadn't showed up without a UA, they wouldn't have been allowed in. The bad news is that IP#1 was new to me, so they got everything they asked for. Grr.
A few days later came a robot I don't know anything about; faute de mieux I've got them labeled "
Robot from Gal Halevy".
IP: 204.11.219.98
UA: quasi-human
Referer: human-like for all css and images
They made about 1300 requests in 13 minutes. This is not completely outrageous-- so why did my error logs show so many "
Client exceeded concurrent connection limit of 30
" lockouts? A closer look reveals that it wasn't 13 continuous minutes; it was clumps adding up to less than a minute of actual, onsite time.
That's bad. They also had trouble with filenames containing lowline _ which they consistently gave as encoded %5F yielding a handful of 404s. Nyaah, nyaah.
And, finally, they made it into piwik.php eighteen separate times (link from piwik.js) though they didn't fool piwik into logging them as human. I have yet to figure out how to block unknown robots from piwik files. Can't do it with THE_REQUEST, because that's how the legitimate ones come in.