Forum Moderators: open

Message Too Old, No Replies

At Home with the Robots

         

lucy24

12:04 am on Feb 12, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



At Home with the Robots

My htaccess was getting fatter and the Ignore list was getting longer. Time for some housekeeping. This means: go back to the beginning and look at all my robots as if I'd never met them before. Within reason, ahem.

At the beginning of January I commented-out the whole "Ignore" list and about half the IP blocks in htaccess. (Keeping the ones that are actively malign, or that come from Belarus or something. I did say "within reason".) A few places I goofed and had to restore the block, but most of them stayed open until the end of the month. After the usual log-wrangling and some extra stuff, I ended up with a bunch of robot information that I normally don't even look at.

Disclaimer: My robots.txt doesn't call anyone by name, except the w3c link checker which has to be allowed in everywhere. (I can't ask it to check links and then refuse to let it in!) I figure bad robots will just ignore robots.txt anyway, so they go straight into htaccess. There's also nothing about crawl-delay. Same reasoning. Finally, most excluded directories are images, so robots don't have a lot of opportunity to violate robots.txt. One html file in a blocked directory is only two links away from the front page. That's the one I look for.

When I say "MSIE Generic" I mean Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1). It's especially popular with Asian robots.


No Skin off my Nose:

There's Googlebot...

Or, I guess, plural googlebots. Along with the plain googlebot I get the imagebot and the mobilebot.

Google Search:
IP: 66.249.63-95
UAs:
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Googlebot-Image/1.0
Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)

The google family is a bit random about robots.txt. Generally they'll pick it up once or twice a day and-- I assume-- spread the information around the rest of the family.

Faviconbot:
IP: 74.125
UA: blank
If I didn't have a <files> exemption for favicon.ico, the faviconbot would never get in at all.

Google Preview:
IP: 66.102.0-15; 74.125; 209.85
UA: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.51 (KHTML, like Gecko; Google Web Preview) Chrome/12.0.742 Safari/534.51
Google Preview is not a robot. At least not in its own eyes. It goes everywhere, including roboted-out directories; I've had to htaccess it to keep it from grubbing around in piwik.

Google Translate:
UA: human user's UA, plus ,gzip(gfe) (via translate.google.com)
OR
UA: human user's UA only, plus referer translate.googleusercontent.com/

Translate gets two lines of its own in the anti-hotlinking routine, because there are actual humans at the far end of those g### referers.

and bingbot too...

The single most common robot if you count all requests. Leave out robots.txt and it drops to #2, behind-- nope, guess again.

I always assumed everything coming from the assorted bing/MSN ranges was much of a muchness. WRONG. There are three entirely distinct critters.

IP ranges: 65.52, 157.55, 207.46
plus 65.54.247.145 for BingSiteAuth (BWT)

bingbot
UA: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
Harmless apart from its unhealthy appetite for robots.txt files. In the course of the month, something like 80% of all robots.txt requests came from the bingbot. Its record was 102 consecutive hits over a period of about 24 hours.

MSN media
IP: 207.46
UA: msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)
An impeccably behaved robot. Each visit has three stops: robots.txt, an image file, and the html file that the image belongs to. On the very last day of the month it had a kind of seizure and got twelve consecutive robots.txt files, but otherwise utterly predictable.

The plainclothes MSIE-bot
IP: 207.46
UA: begins with Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1 or 5.2 and then varies at random
Bad robot. Apart from always claiming to be MSIE 7, the UA changes every time. robots.txt? Who, me? I'm just a human using MSIE 7. Each visit has two or three pickups: a random html file, and any non-image subsidiary files like css or js. This generally results in a lockout when it tries to get piwik.js. At the end of the month I blocked this non-robot in htaccess.

The rising Yandex star...

IP (Russia): 77.88.30.248; 95.108.151.244; 178.154.143.83
IP (US): 199.21.99.80, ..106
UA:
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)

Busy robot, and currently very well-behaved. If you exclude robots.txt, Yandex was my single most common visitor this month. Like the Googlebot, it picks up robots.txt once or twice a day, and does what it says. The imagebot comes by less often, and always starts with robots.txt. The 95.108. range used to be used by YandexImages only, but has now been taken over by the regular YandexBot while Images use 178.154. But almost all Yandex visits this month came from their new-to-me US range. They never used it before this calendar year.

and the rest...

IA archiver
IP: 174.129.237.157
User-Agent: ia_archiver (+http://www.alexa.com/site/help/webmasters; crawler@alexa.com)
IP: 207.241.224.41
User-Agent: ia_archiver(OS-Wayback)

Some people hate this robot on principle, some love it. As a robot it behaves perfectly well. Each separate visit begins with robots.txt. Late in the month they tried out a new robot, ia_archiver(OS-Wayback), from a different IP, 207.241.224.41. The range belongs to TIA but I checked with them and they confirmed it's their robot. I believe this is the only time anyone has ever answered an "Is this your robot?" e-mail. Kudos.

exabot
IP: 193.47.80.81
UA: Mozilla/5.0 (compatible; Exabot/3.0; +http://www.exabot.com/go/robot)
robots.txt, random file; robots.txt, random file. Yawn.

Ezooms
IP: 208.115.111.72, ..113.88
UA: Mozilla/5.0 (compatible; Ezooms/1.0; ezooms.bot@gmail.com)
Same behavior as exabot. Distinguishing feature: UA that includes a gmail address. Normally blocked because they share an IP with the not-so-nice dotbot. But dotbot hasn't been around in a while. They probably have a wider IP range, but during the whole month they only used those two.

findlinks
IP: 77.20-21; 130.82-84; 139.18
UA for 77.20-21: findlinks/2.0.2 (+http://wortschatz.uni-leipzig.de/findlinks/)
UA for 130.82-84: same except 2.0.4
UA for 139.18: same except 2.1.5
I really don't know what they're up to, but you can't get much more respectable than Leipzig University. Various IPs, all belonging to Leipzig U. Info page says alarmingly "Changes in [robots.txt] are noticed at the latest after 30 days." They don't mean "noticed", they mean "take effect"; the German side says "Änderungen ... wirken sich ... aus". I doubt this is true, though, because they pick up robots.txt every single time.

gigabot
IP: 64.22.106.82
UA: Gigabot/3.0 (http://www.gigablast.com/spider.html)
The name is wishful thinking. This has to be the Most Boring Robot Ever. robots.txt, front page, robots.txt, front page, and so on through the month. I am not a front-driven site; the front page never changes.

MJ12bot
User-Agent: Mozilla/5.0 (compatible; MJ12bot/v1.4.1; [majestic12.co.uk...]
Obvious problem: they refuse to have their own IP, so you can't tell if it's the real thing or a spoofer. They also seem to have a lot of trouble getting names right: constant directory-slash redirects alternating with top-level www redirects. One time they asked for the entire contents of the /hovercraft/directory, but they forgot to include the directory name. Result: 404, 404, 404.

orangeask
IP: 50.23.239.14
UA: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729) note the space
Distinguishing feature: uses its name as referer. Lives in an unsavory neighborhood, but its own behavior is almost as boring as gigabot. Shrug.

picsearch
IP: 217.212.224.183 for robots.txt, ..181 for files
User-Agent: psbot/0.1 (+http://www.picsearch.com/bot.html)
I don't know why they use this name. They didn't ask for a single picture all month, just a handful of html files. (They did pick up a picture or two in early February.)

seznambot
IP: 77.75.77.11, .17
UA: SeznamBot/3.0 (+http://fulltext.sblog.cz/)
I don't read Czech so I have no idea what this recent arrival is all about. Showed up in mid-October. UA points to a blog-- in Czech-- which lives in a domain that looks like a standard ISP. But they ask for robots.txt and seem to have a literary bent. Most robots avoid the ebooks directory like the plague; seznam goes straight for it.

YahooCacheSystem
IP: 98.139.241.24n
UA: YahooCacheSystem
At some time in the past this got onto my Ignore list. At month's end I blocked them by htaccess, so they are now on the I Don't Like Your Face side. Every single visit is two stops: front page + favicon. robots.txt? What's that?

Yeti and Baidu (Japan)
IP: 61.247.204 (Yeti), 119.63.196 (Baidu)
UA:
Yeti/1.0 (NHN Corp.; [help.naver.com...]
Baiduspider+(+http://www.baidu.com/search/spider.htm)
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
For some reason I never looked up Yeti until this month. They're from Korea. All they ever do is pick up robots.txt followed by the front page. That puts them in "no skin off my nose" territory. I lumped them together with the Japanese Baidu (same UAs as China, different IP) which hardly ever asks for anything but robots.txt

The rest of the rest:

Finally there are the visitors who must have some pretensions to robot-dom, because they asked for robots.txt. Most of them went on to ask for the front page and nothing more. Honestly, if that's all they're going to ask for, I don't even particularly care if they got robots.txt or not.


I Don't Like Your Face:

Baidu (China) and the gang:
IP:
Baidu: 123.125.71 (123.112-127); 180.76.5-6 (also 180.149.128-149); 220.181.108
Soso: MSIE Generic 124.115.1.7-8, Sosospider 124.115.6.13 for robots.txt only
Sogou: 220.181.125.68 (220.160-191)
UA:
Baiduspider+(+http://www.baidu.com/search/spider.htm)
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html) both same as Baidu-Japan
Baiduspider-image+(+http://www.baidu.com/search/spider.htm)
Sosospider+(+http://help.soso.com/webspider.htm)
Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)
Baidu-China and Baidu-Japan use the same two UAs, they just operate from different IPs. I also get the sosospider, sogou, jikespider... Aw, heck, just lock out the whole country.

The sosospider seems to have been demoted-- or purged or whatever they do to robots-- in September. Now it is only allowed to identify itself when it asks for robots.txt. The rest of the time it comes in as MSIE Generic. Conversely, Sogou is embarrassed to be seen asking for robots.txt, so it takes its clothes off.

There are a bunch of other Chinese robots but there's a dreary sameness to them. They hardly ever ask for anything but the front page, and they all without exception wear MSIE Generic. And, like MJ12, they seem to have a terrible time remembering that it's with www. to get in. Or to get locked out, as the case may be.

The Ukrainians
IP: 92.249.0-127, 109.120.128-191, 178.136-137, 193.106.136-139, 213.110.128-159
UA: random
I suppose everyone has Ukrainians. Not necessarily from the Ukraine, of course. ("A John Wayne movie doesn't have to have John Wayne in it.") They used to make me absolutely livid. But I've grown fond of them because, well, they don't do anything. They swing by with a made-up UA and obviously bogus referer (this months's favorite: vampira.ru), pick up three copies of a page that's no earthly use without the accompanying images, and leave.

They've got two basic patterns: the three-hit visit and the six-hit visit.
Three-hit: put on a random UA, make up a referer, ask for three successive copies of Know Your Lion, leave.
Six-hit: same, only this time each request is in pairs: Lion again, and front page. Change UAs between each pair, but keep the same referer.

Somewhere along the line they've picked up a new guilty thrill: spoofing the googlebot. This is so exciting that when they make a six-hit visit they don't change UAs but stick with the ersatz googlebot all the way through. They've also got a variant where the Googlebot string is preceded by \xef\xbb\xbf. I don't know what encoding this is, if any, but &#xEFBBBF; is one form of the "zero width no-break space". In other words, it's supposed to be invisible but doesn't quite work that way.

At one point they arrived from a new IP that I hadn't blocked (the 213.110 range). Obviously the same robot. But they didn't do anything different or ask for any additional files.

ahrefsbot
IP: 213.186.127.6
User-Agent: Mozilla/5.0 (compatible; AhrefsBot/2.0; +http://ahrefs.com/robot/)
I have no idea what, if anything, they're about. I just know that they seem to think robots.txt is non-perishable: about once a month they pick up three copies in a batch, and then carry on regardless. Don't know whether they even read it; they don't dig deeply enough for me to be sure.

facebookexternalhotlink
IP: 69.171.224-255; 66.220.144-159
UA: facebookexternalhit/1.0 (+http://www.facebook.com/externalhit_uatext.php) or 1.1
I may not have their name exactly right, because I can't pretend I've never seen these guys before. They Just Annoy Me. No robots.txt, of course. They're blocked from all images because their only reason for getting the files is to facilitate hotlinks. They do get a few html files. No idea what they do with them; they're invariably logged as 206. (I have no idea what 206 means. I know that if it's part of an "if/then" condition it means "If the file has changed, give it to me, and otherwise just give me the HEAD". But logs don't say if that was the form of the request.)

Odd detail I just noticed: The 1.0 UA string is used for one specific image. The 1.1 version is for everything else. It doesn't coincide with anything obvious like the first time they tried for the file.

oBot
IP: 206.253.224
UA#1: Mozilla/5.0 (compatible; oBot/2.3.1;+http://www-935.ibm.com/services/us/index.wss/detail/iss/a1029077?cntxt=a1027244)
UA#2: Mozilla/5.0 (compatible; oBot/2.3.1; +http://filterdb.iss.net/crawler/)
Remember when everyone was paranoid about IBM? This is their robot. They're apparently returning from a previous visit over a year ago, because they came in asking for a batch of image files that used to exist but no longer do. A week later they came back, picked up all the replacement images, plus one wildly random painting-- and then failed to get a few files belonging to "Three Blind Mice" because they got the casing wrong.

Oh, yes, they did start each visit by asking for robots.txt. But since one of the image files they asked for-- twice-- is in a roboted-out directory, they can't have read it very carefully.

Trendmicro
IP: 150.70 and 216.104.15
User-Agent: MSIE Generic
Some folks don't mind them. I say I don't like their face, even when they're tagging along behind a human. This happens sometimes but not consistently.

websense
IP: 208.80.192-199 (this month they only used ..194)
User-Agent: begins Mozilla/4.0 (compatible; MSIE 6.0; or 7.0; the rest varies
One of the first robots I ever blocked. I think "sense" is a euphemism for "censor". Could not reuse a UA to save its life, even if it's just been redirected and never even got to use the first UA. They have never asked for anything but my front page. That shows how much they annoy me. Generally the front page doesn't count, because there's nothing on it.

They have never visited me from their other IP, 208.87.232-239. It's much worse. During this same month they blasted my art studio's site, picking up every single file-- most of them roboted-out-- in one visit. Luckily they don't read javascript, so they didn't get the full-size images. Oh, and they forgot robots.txt over there too.

auto-referers

Is this the latest robotic trend? My site happens to be laid out so no page ever includes a link to itself. (As a user, I get annoyed and confused by sites that do this. "Uhm, wait, wasn't I on this page already?") But robots like doing it. They probably think-- correctly-- that this makes it easier to slip under the site's radar. I'm ### if I can hammer out the right wording for htaccess. Most requests are for e-books-- which are in the public domain anyway-- so it's definitely The Principle of the Thing.

snoops

This batch makes me uneasy because I don't know which ones are legitimately collecting information and which ones are up to no good. Their referer typically has a query involving my sitename, along with something vaguely legit-sounding. About half of them are blocked by IP.

"mydomain.com" is me. I've deleted the http://element from the referer, and replaced one word with "zzz" because apparently you're not allowed to say it.

January's batch is typical:

IP: 62.219.132.115 (twice)
Referer 1: www.we-globe.net/Weblab/SiteCommonGraveReport/mydomain.com/
Referer 2: www.we-globe.net/Weblab/SiteCommonGraveReport/www.mydomain.com/
UA: Mozilla/5.0 (Windows; U; Windows NT 5.1; fr; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.8

IP: 64.246.nn.nn (twice)
Referer: whois.zzz/mydomain.com
UA: Mozilla/5.0 (Windows; U; Windows NT 5.1; en; rv:1.9.0.13) Gecko/2009073022 Firefox/3.5.2 (.NET CLR 3.5.30729) SurveyBot/2.3 (zzz)

IP: 77.78.109.76
Referer: www.pagesinventory.com/domain/www.mydomain.com.html
UA: Mozilla/5.0 (Windows; U; Windows NT 5.1; cs; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11

IP: 88.214.193.166 (twice)
Referer 1: www.zzz.com/info/mydomain.com
UA 1: Mozilla/5.0 (iPod; U; CPU iPhone OS 4_1 like Mac OS X; sv-se) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7
Referer 2: www.whorush.com/search/?q = mydomain.com
UA 2: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SRS_IT_E879027FB2765B543EA993; SRS_IT_E8790576B376555131AE90)

IP: 176.9.87.106
Referer: whois.zzz/mydomain.com
UA: Mozilla/5.0 (Windows; U; Windows NT 5.1; en; rv:1.9.0.13) Gecko/2009073022 Firefox/3.5.2 (.NET CLR 3.5.30729) SurveyBot/2.3 (zzz)


Shoot to Kill

... and then there are the really malign robots. These are the ones that ask for php files-- which I haven't got, luckily. One of them did something so outrageous, the host blocked them at the door. This only happens about twice a year. When the requests involve
/admin/sqlpatch.php/password_forgotten.php
you are probably right to be suspicious.

Round off the month with assorted "wp-login", "phpmyadmin" and the obligatory "muieblackcat". These are all blocked in htaccess just on principle. A 403 is more satisfying than a 404.


One-Offs

If we all listed our one-hit wonders, there would be no end to it. So this is just the highlights.

Stupid Robot (I couldn't think of a better label :))
IP: 159.253.145.175
UA: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:8.0) Gecko/20100101 Firefox/8.0
The month got off with a bang. With 394 hits in 50 seconds I would put it at the top of the ### list ... if it weren't for its mind-boggling, over-the-top, jaw-dropping, have-to-see-it-to-believe-it stupidity. Oh, all right. Its human programmer's stupidity. In particular, it was completely clueless about what comes after <a. So <a class = "aaa", <a name = "aaa", <a href = "#aaa" and <a rel = "aaa" were all read as <a href = "aaa". This naturally led to a lot of 404s-- and also kept them from following no-follow links.

Besides ::cough-cough:: it drew my attention to a few overlooked links to /index.html. Also a whole set of relative links, where
../otherdirectory/Filename.html
within /directory/ got interpreted by the robot as
/directory/otherdirectoFilename.html

Yahoo! Slurp
IP: 98.137.72.218
UA: Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; [help.yahoo.com...]
This too was on my Ignore list, but has now gone off to join the YahooCacheSystem in htaccess-land. It only stopped by once in the course of the month-- slurping up a complete page with all subsidiary files except, for some reason, piwik.js. robots.txt? Hahaha.

robot from Lightspeedsystems (for want of a better name)
IP: 69.84.207.147
UA for html: Mozilla/4.0 (compatible; MSIE 7.0;Windows NT 5.1;.NET CLR 1.1.4322;.NET CLR 2.0.50727;.NET CLR 3.0.04506.30) Lightspeedsystems
UA for images: Mozilla/4.0 (compatible; MSIE 7.0;Windows NT 5.1;.NET CLR 1.1.4322;.NET CLR 2.0.50727;.NET CLR 3.0.04506.30) note spacing in both
OK, here I goofed. The range hasn't been around in ages so I unblocked it for the sake of experiment-- and then had to quickly reblock it after the horse was stolen. To really appreciate these guys you'd have to see a screenshot of my processed logs. It's all color-coded; that means stripes, because every single directory had to get redirected from without-slash to with-slash form. (It did not get these links from me.) And it went through its list in straight order of name length, from the three letters of /fun/ to the ten of /hovercraft/. Only the top-level directories-- a total of six times at intervals of a few hours. Each request was logged as 301 followed by 206. At the end of each set it picked up one or two images. This time for variety's sake it asked for the wrong domain name, so it got a www. redirect instead.

This is a scary robot. I blocked it a long time ago. It comes by every now and then, makes a set of three attempts, and goes away. This time it wasn't blocked, so it stuck around. Robots don't normally do that. They come in with a shopping list and it makes absolutely no difference whether every single request gets a 403. They carry on regardless.

HTTrack
I've complained about them elsewhere, so I won't repeat it here. Ugh, ugh, ugh.

mail.ru
IP: 217.69.133.30
UA: Mail.RU/2.0 (I've also seen .Ru and .ru)
With a UA like that, you'd think they must be up to no good. But really they're harmless. They come by at long intervals-- just once in the course of this month-- ask for robots.txt, one page and all its associated images. The page is one that I made in 1998 and basically haven't touched since. No skin off my nose.


Postscript:

The robots must have been on a post-holiday hiatus. Early this month (February) I got walloped by two of the worst I have ever met.

First came a Bezeq tag team.
IP#1: 62.219.8.228
UA#1: four different quasi-humans at random. I was especially struck by (in full) Mozilla/6.0 (compatible). News to me.
IP#2: 192.114.71.13
UA#2: blank

The first member of the team raced through and grabbed every single page on the site, including the roboted-out ones, at a rate of about 2 files/second. Well, in fairness, they didn't know the pages were roboted-out, because they didn't stop to read robots.txt. They also picked up all image blowups that have <a href> links (free-standing image, not on a page). I've seen this pattern elsewhere: link rather than filetype.

The second member of the team showed up about 20 minutes after the first one left. They requested, as far as I can tell, all accessible images, at a rate of about 4 files/second. ("Accessible" meaning they're linked from somewhere in <img src> form.)

The good news is that I'd previously met IP#2, so even if they hadn't showed up without a UA, they wouldn't have been allowed in. The bad news is that IP#1 was new to me, so they got everything they asked for. Grr.


A few days later came a robot I don't know anything about; faute de mieux I've got them labeled "Robot from Gal Halevy".

IP: 204.11.219.98
UA: quasi-human
Referer: human-like for all css and images

They made about 1300 requests in 13 minutes. This is not completely outrageous-- so why did my error logs show so many "
Client exceeded concurrent connection limit of 30
" lockouts? A closer look reveals that it wasn't 13 continuous minutes; it was clumps adding up to less than a minute of actual, onsite time. That's bad. They also had trouble with filenames containing lowline _ which they consistently gave as encoded %5F yielding a handful of 404s. Nyaah, nyaah.

And, finally, they made it into piwik.php eighteen separate times (link from piwik.js) though they didn't fool piwik into logging them as human. I have yet to figure out how to block unknown robots from piwik files. Can't do it with THE_REQUEST, because that's how the legitimate ones come in.

incrediBILL

1:24 am on Feb 12, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



My htaccess was getting fatter and the Ignore list was getting longer.


That's because you don't whitelist.

Blacklisting is bulky, time consuming and inefficient.

Not to mention it puts a bigger load on the server.

wilderness

6:24 am on Feb 12, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Bill,
With all due respect.

I simply don't understand this animosity towards black-listing. There are literally thousands of old threads here at Webmaster World for reference.

Whilst, for white-listing there is next to nothing. A few basic guidelines for users to reference.

And yet while yourself and others have had whitelisting in place for some while there was even a reply to my inquiry concerning quotes in this thread [webmasterworld.com] which would almost seem 2nd nature for somebody utilizing whitelisting and being aware of various browsers and the solutions to allow same browsers.

Whether white or black, they both remain effective.

keyplyr

9:18 am on Feb 12, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I black-list & white-list...

I white-list the 4 major SE bots w/ their assorted programs (UA filtered by verified IP range) and block the rest.

I UA black-list all *known* scrapers, downloaders, pdf makers, image grabbers/makers, and other file harvesting tools that stand lone or browser-side. Occasionally a new one comes along. These need to be manicured every few months for dead wood.

Server farms, clouds, hosting services, colos, etc are black-listed by IP range.

I also block every known IP range from China :)

lucy24

10:07 am on Feb 12, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I also block every known IP range from China

Every time I think I've got them all, another one pops up. Just today I found one that's a ### /10. How the bleepity bleep they managed to lie low this long...

I don't understand whitelisting. I have never understood whitelisting. I probably never will understand whitelisting.

my inquiry concerning quotes in this thread [webmasterworld.com]

Give or take an ell ;)

the 4 major SE bots

Someone out there must not like me. I've only got three SEs that I would count as major.

Frank_Rizzo

11:05 am on Feb 12, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



+1 for whitelisting.

I setup sebastians-pamphlets smart robots script. Google, Bing, Yahoo are the only 3 that get a valid robots page. Anything else is 403'd off.

This is saving me hours each week in bot monitoring.

lucy24

11:39 am on Feb 12, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm sorry, I'm missing something fundamental and it's making me grumpy. Or maybe it's because it is 3:30 AM and the cat will not let me move. After reading it backward, forward and sideways, and pausing to ponder the fact that g### returns seven consecutive versions of the identical article under different names...

How do you intercept the robots and get them within grabbing distance of your robots.txt-on-steroids in the first place?

keyplyr

1:56 pm on Feb 12, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Someone out there must not like me. I've only got three SEs that I would count as major.

Googlebot, Yandex, Slurp & Bingbot

...and I used to allow Baiduspider until I made the decision to block China. However I do allow the Japanese version.

Frank_Rizzo

4:04 pm on Feb 12, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I did not like whitelisting. I assumed that it was wrong. Why would you want to keep bots out? What happens if there is a new bot and you don't know about it? You may be losing customers?

Then one day you realise that frisking, researching, wondering if to allow or not allow every new bot that comes along is totally pointless. The vast majority of bots are not bots. They are scrapers, scammers, leachers.

Seriously. Run a smart robots php script in place of your robots.txt. Let the true file be read by your favourite crawlers. Stuff the rest of them.

incrediBILL

7:52 pm on Feb 12, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I simply don't understand this animosity towards black-listing.


Pretty funny from the guy who blocks the globe from a US site, you're the ultimate whitelister if ever there was one!

There is no animosity.

Blacklisting is simply a waste of time and resources, both human and server.

That's like saying I have animosity over Vinyl records or CDs now that I can carry thousands of MP3s in my pocket and access them instantly vs. a room full (which I had) where you have to dig around for stuff. Nor do I have animosity over books and magazines, which I also had in mass quantity, now that I have a Nook tablet and can carry what would've been tons worth of books in single small device.

Same with whitelisting, mass quantity of meaningless junk clogging the system and wasting time when in reality it's more the handful of things most allow vs. the high quantity of junk disallowed.

It's just a different paradigm that is clearly more efficient than the old way.

Been there, done that, will never do that again.

However, I do have a beef with newspapers.

incrediBILL

8:00 pm on Feb 12, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Why would you want to keep bots out? What happens if there is a new bot and you don't know about it? You may be losing customers?


Why keep bots out?

Ever spent a day or week chasing down copyright violators and wasting time and money hunting down site owners and sending DMCA requests?

Boring, costly, annoying, aggravating, expensive...

Depends on the bots and what they do.

I let a bunch of bots in, but if they don't identify themselves and display a legit reason to crawl, they don't get in.

How do you find new bots abusing your site?

You find new bots that could be beneficial the same way.

Personally, I have a script and get notified by my site when something new hits that has never been there before so I can see if it's worth letting it access my site.

The problem isn't which method is used, blacklisting or whitelisting, as both methods but both ways, issues either way. The problem is a lack of good tools to make either easy for the site owner to manage. That landscape is starting to change with better tools already available, they just aren't free.

dstiles

10:59 pm on Feb 12, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Lucy - I decided about two years ago to put a list of my "allowed" bots in the Alternative forum hereabouts - still haven't found time. :(

Yours is a good list BUT - you give several IP ranges that I would never have permitted or, once blocked, released. Once a server farm has been discovered it's killed. If it's ever switched to DSL in future it'll have to convince me. If there is a good bot on the range then I whitelist the bot WITH the IPs it uses.

I do not have php on my server so ANYTHING that comes in asking for a php extension gets killed. There have been an awful lot of them recently, all as far as I can tell from botnets.

I've allowed Seznam for some time with no perceived problems. It's a "local" SE for CZ as far as I recall.

I began by allowing YahooCacheSystem but it's killing itself slowly, especially from the Eastern IPs. With no reason to allow bot traffic, since MS provides their SERPS now, I may also kill slurp.

Like you, I kill baidu China and allow baidu Japan. I'm not at all sure they do not share their scans, though. I only allow the JP one because a client has trade from there.

Bill - I cannot see a way (at least using my system) of white-list only. A LOT of bots come with perfectly legit UAs but they are obviously bots. Many come from server farms which, once found, are easy to block (my IP ranges are in MySQL so "file size" isn't much of a problem). I get a lot of botnet junk from servers and broadband so I can't white-list any dynamics as such, only allow a broadband range until one of its IPs mis-behaves, when it gets a 403 for a while or, if persistent, gets added to the Always Ban list.

Frank - if you really mean "robots.txt" then you will miss 90% of evil bot traffic. It has to be htaccess or something similar because bad bots never check in.

lucy24

11:37 pm on Feb 12, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Googlebot, Yandex, Slurp & Bingbot

I was afraid you'd say that. As noted in the OP, I've ended up locking out both YahooCacheSystem and Yahoo! Slurp because I don't care for their behavior and they don't do me any good that I can see.

What happens if there is a new bot and you don't know about it? You may be losing customers

Whole nother thread there ;) If the big sites routinely lock out all but the major, established search engines, then there's a fabulous market in "The best of the rest" searches just waiting to be tapped.

And the day someone scrapes my content is the day I fall dead of shock, so let's call that a non-issue :) That's deliberately excluding the ebooks, which are public domain (out of copyright in the US) and widely available from other sources.

Still looking for someone who can explain to me in words of two syllables how you distinguish a robot from a human in the first place.

you give several IP ranges that I would never have permitted or, once blocked, released.

If you read carefully you'll note that I deliberately didn't distinguish between blocked and un-blocked. I looked at everyone, including the ones who never got anything but a fistful of 403s. I didn't keep before-and-after copies of the htaccess so I don't know how many blocks were restored promptly at month's end. I know I've still got an awful lot commented-out. Most of those are the visitors who once annoyed me terribly but haven't shown their faces since. On the other side is the "Well, OK, let's not overdo it"-- like the Ukrainians, who simply have to be locked out even though they really don't do anything.

wilderness

12:49 am on Feb 13, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Pretty funny from the guy who blocks the globe from a US site, you're the ultimate whitelister if ever there was one!


Bill,
Websites for me are a second thought.
They merely offer a medium for widgets.

Unfortunately non-US & Can countries (not just bots, but John Q. Publc as well) are the worst offenders for plagiarism and language barriers present simple misinterpretation of very simple TOS.
FWIW the worst plagiarizers have to been the Oceanic countries

Why keep bots out?

Ever spent a day or week chasing down copyright violators and wasting time and money hunting down site owners and sending DMCA requests?

Boring, costly, annoying, aggravating, expensive...


I'm not sure when I last explained what I do?
For more than a decade, I've been digitizing previously published widget (these particular widgets are a very, very narrow and dying market share) periodicals (most have gone under long ago).
The hours behind a computer, scanner and the periodicals are in the tens of thousands of hours. (For the sooth sayers, I do realize original copyright exists, however copyright on the OCR'd digital content also exists (simply because it didn't exist in the past).

Websites serve as a venue to present select widget content.
In many instances the realization to the widget people that they have located some widget material that either they've never seen or wasn't previously available is a sever temptation.

Very early on, with website, I'd spend all night putting new pages online, only to retire for then evening and realize (after viewing logs) that 3-4 non_North American bots/people/harvesters had grabbed my entire culmination of work in seconds.

At first communication and compliance was attempted. That had the same effect as urinating in the wind.
Denial was was the easiest solution.

Simultaneously, I'm open to immediate review to make exceptions for locales outside the borders of the US & CAN.

About a year ago I was introduced to a widget fan in Hungary and it has been most enjoyable. I have other exceptions in the Scandinavian countries, however the problems with the Euro IP providers making very broad changes in their IP methods to retail customers is simply too overwhelming to keep it with (in three days I've three different Class A's from my friend in Hungary).

No cache is certainly a necessity as well.

lucy24

2:13 am on Feb 13, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



copyright on the OCR'd digital content also exists

Uhm... What country are you in? The US doesn't recognize "sweat of the brow" copyright or UK-style stylistic copyright. I've forgotten the correct term. It's a special category for when you reprint pre-existing material or put it into a new medium. The text itself doesn't regain a lost copyright, but people can't come along and photograph your book.

About a year ago I was introduced to a widget fan in Hungary

Hee. I thought I was kidding in that neighboring thread when I said "all of Europe except Hungary". (Yes, I looked up the IP.) I'll bet most people would have assumed you planned the other way around: allow the rest of the continent but block Hungary :)

wilderness

3:21 am on Feb 13, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Uhm... What country are you in? The US doesn't recognize "sweat of the brow" copyright or UK-style stylistic copyright.


lucy,
recognition is irrelevant to me after tens of thousands of hours. If people or the courts have an issue with my impositions, than they may freely search the www for an alternative (wrots of ruck with that).

My Hungarian friend bonded our relationship some time ago in the following manner.
I had a 1936 article which details a Polish person whom was jailed after the Bolshevik Revolution as a traitor because he was an employee of the Czar. The article mentions he returning to Poland (the place of his birth) after having spent his entire life in Russia and having been released from Prison.
I found a few references to his Polish publications on the WWW from the 1920s, however nothing after that.
At the time, it was a good assumption that he perished in one of the concentration camps (he was at "Jagellon University, in Krakow".

I pleaded with my then new Hungarian friend and he found some references for after WW II and even the year he passed.

My Hungarian friend thanks me over and over for materials that I send to him.

It's a special category for when you reprint preexisting material or put it into a new medium.


I'm not reprinting anything. Each individual character is OCR'd and recognized. Many characters require editing afterwards. The storage of the finished text creates copyright.

FWIW, these are not modern day periodicals, most are 50, 70 or even 100 years old. In most instances that the periodical exists in another copy is quite rare.

Besides, its already proven in the courts that work creates copyright. Even this submission is copyrighted with the properly saved verification.
Just saving a work (s) on your machine creates copyright.
Alas the every merry-go-round of copyright ;)

incrediBILL

6:00 am on Feb 13, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The US doesn't recognize "sweat of the brow" copyright or UK-style stylistic copyright.


Maybe I'm misunderstanding your comment, but in the US the minute you create it, it's copyrighted. Period.

For instance, the minute you take a picture, it's a copyrighted image, nobody can publish is without permission unless they want to be a copyright infringer.

However, there are no legal penalties such as statutory damages unless you file for a legal copyright, just basic copyright infringement only.

wilderness

6:11 am on Feb 13, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



FWIW Bill, lucy was referring to the previously copyrighted materials that I OCR from older periodicals.

Many of these periodicals and other published works would have expired after the traditional 73-year-limit.

However there was an extension (25 years) named the "Sony Bono Act".
I got you, babe ;)

incrediBILL

8:54 am on Feb 13, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Ah ha... I was speed reading thru WebmasterWorld, context may have slipped through the cracks.

dstiles

9:51 pm on Feb 13, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Wilderness:
> non-US & Can countries ... are the worst offenders

Not here in UK. Most of the offenders I see here are USA server farms - great swathes of them. Plus, of course, a fair sprinkling of ex-USSR, CN, TW, TH and IN. And DE farms, of course. And a fair number of UK DSL (my own country) who seem to be trying to be clever rather than malicious.

wilderness

10:14 pm on Feb 13, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



dstiles,
server farms are abundant globally. although the US certainly has more than their fair share.

The Euro countries (including the UK) present a never ending change of UA's and IP's that are simply too time consuming (at least for my widgets and their traffic) to keep abreast of. (i. e., the effort is more than the potential benefit from the traffic).

lucy24

10:47 pm on Feb 13, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



And DE farms, of course.

As already noted I'm not very strict about locking people out. But if I look up an IP and find the name Hetzner anywhere in there, I tend to throw up my hands and say What the ###, just ban 'em :) More trouble than it's worth, otherwise.

From the user end I do worry a little about people with cast-iron firewalls, because some experimental pinging tells me that everything from my part of the state has to pass through The Planet's nearest server (I think it's in Santa Rosa) to reach the rest of the world. This is literally true. There's one physical cable.

incrediBILL

2:22 am on Feb 14, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Most of the offenders I see here are USA server farms


I agree.

Many of my sites #1 offenders are from US servers, then often CN coming in #2

adrian20

6:16 am on Feb 14, 2012 (gmt 0)

10+ Year Member



lucy24, I want to thank you for this post. This reading is absolutely exquisite. Appropriately shows the muscles that hold the knowledge and experience of those who face this battle hidden within the Internet.

This should be mandatory reading for those who want to know more about Apache Log, Service Web Hosting, Servers, UA, and how the Internet itself daily struggle for her existence.

iamzippy

6:15 pm on Mar 10, 2012 (gmt 0)

10+ Year Member



Excellent post. Just about everything in there chimes with me. Especially Websense.

Websense is a self-appointed arbiter of what's good or bad on the Web. It has a customer base from whom it takes real money in exchange for access to its classified ratings database. Its droids obey a common set of rules:

Never ask for robots.txt
Always ask for the home page, and never send a referrer
Never fetch any assets linked-to from the page
Always specify an MSIE UA, of 7.0 or earlier (right down to 1.x)
Never use the same UA twice in succession
Follow 301 permanent redirects but never remember the new location (and switch UAs on-the-fly)
15% or more of the time, let it be known the client is festooned with adware or malware (ZangoToolbar, FunWebProducts, Hotbar)

Websense! Die! Die! Die!

And the auto-referrers? Well they're hoping the 'referring' URL is gonna show up somewhere, maybe in your analytics?

Auto-referrers! Die! Die! Die!

Tip o'hat to you for that one.