Up-to-date bot identification?

Forum Moderators: open

Message Too Old, No Replies

Up-to-date bot identification?

Reno

9:31 pm on Jan 30, 2007 (gmt 0)

I use a cgi script to register bots that crawl my sites. The one below is hitting me like crazy -- often many times each day:

1/22/2007Ś66.189.12.110ŚMozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)

I have 2 brief questions...

[1] When I see that a bot is not identified, should I automatically assume it is up to no good?

[2] Is there a website that attempts to keep up with bot identification on a daily basis (or as realtime as is reasonably possible)?

I do have a lot of bot ID sites bookmarked, but none of them seem to have that one above so I'm guessing it is pretty new (?).

Thanks for any feedback...

.............................

wilderness

5:58 am on Feb 1, 2007 (gmt 0)

The IP range from Georgetown, Mass and a subnet of Charter.
It's not really a large subnet range.

In the event that you have undesired crawling or continued crawling that is not identified by User Agent?

Your the only one capable of determining if that activity is beneficial or detrimental to your website (s).

Personnaly, I'd deny the range if it was crawling my sites.

Romeo

11:27 am on Feb 1, 2007 (gmt 0)

[1] When I see that a bot is not identified, should I automatically assume it is up to no good?

Yes, definitely. At least, a bot has to identify itself properly. Anything else is rude.

[2] Is there a website that attempts to keep up with bot identification on a daily basis (or as realtime as is reasonably possible)?

Too much work. Generally, there are only 3.5 kinds of bots:

[1] those feeding a public search engine like yahoo, google, msn, etc. They are mostly useful, for yourself and for the public.

[2] special bots like copyright checkers, they may obey robot rules and are formally legitimate then, but they are of special use for the bot owners only, feeding their business to earn them money and stealing your bandwidth, and are mostly useless to you.

[3] bad bots like content scrapers, mail address harvesters, whatever. Some identify with a name, some are anonymous and hide behind user UAs. Bad. Rude. Annoying. Useless.

[3b] User browser extensions to scrape entire web sites for 'offline browsing' or private content scraping. Rude and useless for a website owner as well.

To recognize bots of category [1] is easy.
And the others can just be blocked without investing too much research efforts (they aren't worth that at all, are they?).

Kind regards,
R.

GaryK

6:52 pm on Feb 1, 2007 (gmt 0)

With all due respect Romeo. Your answer is fine for you. It might not be fine for someone else. As webmasters we cannot tell each other what bots to block. We can only relate the facts as we see them and tell people what we're going to do. From there it's up to each webmaster to decide what to block. For example, most people let Yahoo! crawl their sites. I don't. It would be wrong of me to tell Reno to ban Yahoo! just because I hate them and get no traffic from them to warrant the abuse and insults they're constantly heaping upon on my websites and myself. :)

incrediBILL

10:59 pm on Feb 1, 2007 (gmt 0)

Too much work.

That's an opinion.

I track everything you mentioned automatically, therefore the only work I had to do was write the script in the first place and watch it do it's thing with a periodic review of the results to make sure something isn't slipping thru the cracks.

wilderness

12:40 am on Feb 2, 2007 (gmt 0)

This forum has long been a place where folks gather an accomplish similar tasks in different methods and even different functioning methods of Rewrites and htaccess.

For the most part my logs are reviewed manually.
This allows me to give the potential widget vistitor that I desire the assumption of an established innocence and indifference to a harverster or crawler.

There was a learning curve initially and there was curve to both establish and separate the majority of the undesireables from the desireables.

I don't believe today that I would implement or even take the take to learn what today might best be desribed as "second nature".

Don

incrediBILL

2:19 am on Feb 2, 2007 (gmt 0)

There was a learning curve initially and there was curve to both establish and separate the majority of the undesireables from the desireables.

Yes there was, but the undesirables also have been going through a learning curve. The problem I see increasingly, which is hard to see manually reviewing a log file as it's meant to be hidden, is how to stop all the undesirables that pretend to be desirables.

Some of the super stealth operations download images, run javascript, they look just like humans to the naked eye reviewing a log file.

However, in real time, those stealth bots can't respond to a simple captcha like "What's 10+4?" or "What color is the sky?" and continue to keep asking for dozens or hundreds of pages.

Soon they'll catch on to my tricks and try harder to hide and bypass those as well because the Cyveillance, Lightspeeds and IRS bots of the world really don't want to be seen.

Back to the OP's original question, other than a handful of maybe 10 bots, most everything else is useless which is why I whitelist those 10 and block everything else by default.

Just keep in mind you're still missing bots that don't want to be found, but don't let it stop you from sleeping at night ;)

[edited by: incrediBILL at 2:24 am (utc) on Feb. 2, 2007]

wilderness

2:53 am on Feb 2, 2007 (gmt 0)

is how to stop all the undesirables that pretend to be desirables.

With my method (no scripts) it requires knowing my websites and related, pages, subjects and links.

Back to the OP's original question, other than a handful of maybe 10 bots, most everything else is useless which is why I whitelist those 10 and block everything else by default.

Bill,
We're almost in agreement here ;) (is that a mircale?)

The excpetions would be if a particular org ran a bot that was widget realated (my widgets, not yours).
The bot hits my site and I understand who they are and relaize the crawling is beneficial to my sites.
For some reason or another the bot hits another sight that is in no way related to the bots widgets and begins crawling.

Just keep in mind you're still missing bots that don't want to be found, but don't let it stop you from sleeping at night

Actaully, in the beginning (RIPE and APNIC) never stopped me from sleeping, rather when I arose in the morning and viewed my logs!
From that point on my days were crap!

GaryK

7:45 am on Feb 2, 2007 (gmt 0)

I understand who they are and relaize the crawling is beneficial to my sites.

Is this something you do in real-time Don?

incrediBILL

8:12 am on Feb 2, 2007 (gmt 0)

With my method (no scripts) it requires knowing my websites and related, pages, subjects and links.

No offense, but I started out bot blocking like everyone else before I figured out blacklisting and post-mortem log analysus didn't work. You simply cannot detect stealth bots post-mortem and that much I was able to prove over 12 months ago.

RIPE and APNIC

Those are childs play, the true steath bots are US or EU corporations like Cyveillance, LightSpeed, or the bots from Picscout (and then some) which to date, I'm the only one that claims to have any evidence of their crawling. I could be wrong, but Picscout is the sneakiest of the bunch and they left fingerprints if I'm right.

The only way to stop the corp clowns is to block hosting companies and punch holes thru that firewall as needed, and I'll chant that mantra until the day they pull my cold blue hands away from the keyboard.

Botnets are another story...

[edited by: incrediBILL at 8:15 am (utc) on Feb. 2, 2007]

wilderness

2:29 pm on Feb 2, 2007 (gmt 0)

Is this something you do in real-time Don?

Gary,
Are you for real ;) ;)

Some days it's more real than others.
I check my logs throughout the days-nights.

Reno

2:33 pm on Feb 2, 2007 (gmt 0)

Personnally, I'd deny the range if it was crawling my sites.

Thanks to everyone for the feedback. Based on the advice from wilderness above I decided to start the process of denying these unidentified spiders, so I posted a thread in the Unix/Linux section of the forum to find the best method for doing that, and quickly learned that unless a person has some real expertise (I don't), it is best to approach the htaccess defense with great caution. But I then found out that the cPanel for the hosting service I use has a feature called "Deny IP Manager" that allows me to simply submit every IP address that I want blocked and they add it to the list for me, so I'm not messing with htaccess directly. I copied the suspicious crawler IP's out of the data that the cgi script captured, and one-by-one added them. Hopefully that will at least stop the ones that I can assume are up to no good, without blocking the few that I do care about (Google, MSN, Yahoo, Ask, Gigablast, etc)....

....................................

wilderness

2:41 pm on Feb 2, 2007 (gmt 0)

No offense, but I started out bot blocking like everyone else before I figured out blacklisting and post-mortem log analysus didn't work. You simply cannot detect stealth bots post-mortem and that much I was able to prove over 12 months ago.

Bill,
Just because you were unable to determine a particular bot or group of bots doesn't mean everybody has the same incapability. (no pun intended.)
Most everybody know that I'm quite over-bearing (to put in politely) in my denies.

Perhaps your sites (s) just get more traffic than mine and as a result the bulk presents too much mass for such analyzation?

The only way to stop the corp clowns is to block hosting companies

Agreed and with a suspicious pattern through my sites and pages, a hosting company would be denied as fast as a colo.

incrediBILL

8:43 pm on Feb 2, 2007 (gmt 0)

Just because you were unable to determine a particular bot or group of bots doesn't mean everybody has the same incapability.

Really?

OK, here's a couple of live ones today, maybe you can tell me what these are:

62.194.97.* [*.upc-h.chello.nl.] requested 109 pages as "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"

89.98.135.* [unknown] requested 58 pages as "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)"

89.32.49.* [unknown] requested 56 pages as "Mozilla/4.0 (compatible; MSIE 6.0; Win32)"

24.168.26.* [cpe-24-168-26-*.si.res.rr.com.] requested 63 pages as "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"

90.195.17.* [*.bb.sky.com.] requested 149 pages as "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"

HINT: the number of requested pages is what the IP requested while already in CAPTCHA mode . The CAPTCHA was never answered, as a matter of fact, no CAPTCHA answer was even attempted, no human at the controls here.

My speculation is one or more of the following:
a) scraper
b) offline reader
c) downloader
d) scraper via botnet
e) something stealth like a copyright crawler or IRSBOT that doesn't want to be known

YMMV

Of course my capabilities to determine these bots are limited due to high traffic...

wilderness

9:05 pm on Feb 2, 2007 (gmt 0)

OK, here's a couple of live ones today, maybe you can tell me what these are:
62.194.97.* [*.upc-h.chello.nl.] requested 109 pages as "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
89.98.135.* [unknown] requested 58 pages as "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)"
89.32.49.* [unknown] requested 56 pages as "Mozilla/4.0 (compatible; MSIE 6.0; Win32)"
24.168.26.* [cpe-24-168-26-*.si.res.rr.com.] requested 63 pages as "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
90.195.17.* [*.bb.sky.com.] requested 149 pages as "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
HINT: the number of requested pages is what the IP requested while already in CAPTCHA mode . The CAPTCHA was never answered, as a matter of fact, no CAPTCHA answer was even attempted, no human at the controls here.
My speculation is one or more of the following:
a) scraper
b) offline reader
c) downloader
d) scraper via botnet
e) something stealth like a copyright crawler or IRSBOT that doesn't want to be known

Can't help ya with these:
RewriteCond %{REMOTE_ADDR} ^(80¦82¦84¦8[6-9])\. [OR]
RewriteCond %{REMOTE_ADDR} ^(90¦91)\.

or this
deny from 62.

That leaves 24.168.26.*
I watch everything (like a hawk) that visits me from this provider.
For a long while, I had every IP range they had denied.
If one of their customers acts up again?
I resume the denial!

Looks like I'm useless as "#*$!x on a board" ;)

BTW, didn't you see the "(no pun intended)"?

jdMorgan

9:06 pm on Feb 2, 2007 (gmt 0)

That .ro UA should be easy to get rid of, since "Win32" isn't valid.

Jim

incrediBILL

10:23 pm on Feb 2, 2007 (gmt 0)

That .ro UA should be easy to get rid of, since "Win32" isn't valid.

Good point.

I didn't notice that as I filter blocked things into several heaps such as "BAD AGENT", "DATA CENTER", "BLOCKED BOTS", etc. and everything else is typically a valid UA that exhibits bot-like behavior.

Now I have to go check my filter again.... sigh....

BTW, didn't you see the "(no pun intended)"?

Yes, but I thought it would be amusing to trot out couple a samples, see if anyone was up for the challenge (get it? challenge? captcha? nevermind...), and sure enough JD spotted something that slipped thru a crack!

[edited by: incrediBILL at 10:27 pm (utc) on Feb. 2, 2007]

jdMorgan

10:33 pm on Feb 2, 2007 (gmt 0)

Hah! I had:
RewriteCond %{REMOTE_ADDR} ^(8[0246-9]¦9[01])\. [OR]
...pretty much the same, but shorter.

I couldn't come up with a pun, so cannot join in that fun.

Jim

wilderness

10:33 pm on Feb 2, 2007 (gmt 0)

Yes, but I thought it would be amusing to trot out couple a samples, see if anyone was up for the challenge

It was ;)
However I've had the end with denied for more than two years.

Thanks to your "trot" (widget keyword) however I may need to make an adjustment, because I saw some 2003 references in my data that contained rather than ends.

GaryK

8:12 pm on Feb 3, 2007 (gmt 0)

I really should wait until after my first espresso before trying to figure out what Don and Jim are discussing. I get the feeling there's a private joke going on and I'm not in on it. ;)

wilderness

8:20 pm on Feb 3, 2007 (gmt 0)

I really should wait until after my first espresso

Else you could just have it tubed in intravenously the entire time your sleeping ;)