Forum Moderators: open
1/22/2007¦66.189.12.110¦Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)
[1] When I see that a bot is not identified, should I automatically assume it is up to no good?
[2] Is there a website that attempts to keep up with bot identification on a daily basis (or as realtime as is reasonably possible)?
I do have a lot of bot ID sites bookmarked, but none of them seem to have that one above so I'm guessing it is pretty new (?).
Thanks for any feedback...
.............................
In the event that you have undesired crawling or continued crawling that is not identified by User Agent?
Your the only one capable of determining if that activity is beneficial or detrimental to your website (s).
Personnaly, I'd deny the range if it was crawling my sites.
[1] When I see that a bot is not identified, should I automatically assume it is up to no good?
Yes, definitely. At least, a bot has to identify itself properly. Anything else is rude.
[2] Is there a website that attempts to keep up with bot identification on a daily basis (or as realtime as is reasonably possible)?
Too much work. Generally, there are only 3.5 kinds of bots:
[1] those feeding a public search engine like yahoo, google, msn, etc. They are mostly useful, for yourself and for the public.
[2] special bots like copyright checkers, they may obey robot rules and are formally legitimate then, but they are of special use for the bot owners only, feeding their business to earn them money and stealing your bandwidth, and are mostly useless to you.
[3] bad bots like content scrapers, mail address harvesters, whatever. Some identify with a name, some are anonymous and hide behind user UAs. Bad. Rude. Annoying. Useless.
[3b] User browser extensions to scrape entire web sites for 'offline browsing' or private content scraping. Rude and useless for a website owner as well.
To recognize bots of category [1] is easy.
And the others can just be blocked without investing too much research efforts (they aren't worth that at all, are they?).
Kind regards,
R.
For the most part my logs are reviewed manually.
This allows me to give the potential widget vistitor that I desire the assumption of an established innocence and indifference to a harverster or crawler.
There was a learning curve initially and there was curve to both establish and separate the majority of the undesireables from the desireables.
I don't believe today that I would implement or even take the take to learn what today might best be desribed as "second nature".
Don
There was a learning curve initially and there was curve to both establish and separate the majority of the undesireables from the desireables.
Yes there was, but the undesirables also have been going through a learning curve. The problem I see increasingly, which is hard to see manually reviewing a log file as it's meant to be hidden, is how to stop all the undesirables that pretend to be desirables.
Some of the super stealth operations download images, run javascript, they look just like humans to the naked eye reviewing a log file.
However, in real time, those stealth bots can't respond to a simple captcha like "What's 10+4?" or "What color is the sky?" and continue to keep asking for dozens or hundreds of pages.
Soon they'll catch on to my tricks and try harder to hide and bypass those as well because the Cyveillance, Lightspeeds and IRS bots of the world really don't want to be seen.
Back to the OP's original question, other than a handful of maybe 10 bots, most everything else is useless which is why I whitelist those 10 and block everything else by default.
Just keep in mind you're still missing bots that don't want to be found, but don't let it stop you from sleeping at night ;)
[edited by: incrediBILL at 2:24 am (utc) on Feb. 2, 2007]
is how to stop all the undesirables that pretend to be desirables.
With my method (no scripts) it requires knowing my websites and related, pages, subjects and links.
Back to the OP's original question, other than a handful of maybe 10 bots, most everything else is useless which is why I whitelist those 10 and block everything else by default.
Bill,
We're almost in agreement here ;) (is that a mircale?)
The excpetions would be if a particular org ran a bot that was widget realated (my widgets, not yours).
The bot hits my site and I understand who they are and relaize the crawling is beneficial to my sites.
For some reason or another the bot hits another sight that is in no way related to the bots widgets and begins crawling.
Just keep in mind you're still missing bots that don't want to be found, but don't let it stop you from sleeping at night
Actaully, in the beginning (RIPE and APNIC) never stopped me from sleeping, rather when I arose in the morning and viewed my logs!
From that point on my days were crap!
With my method (no scripts) it requires knowing my websites and related, pages, subjects and links.
No offense, but I started out bot blocking like everyone else before I figured out blacklisting and post-mortem log analysus didn't work. You simply cannot detect stealth bots post-mortem and that much I was able to prove over 12 months ago.
RIPE and APNIC
Those are childs play, the true steath bots are US or EU corporations like Cyveillance, LightSpeed, or the bots from Picscout (and then some) which to date, I'm the only one that claims to have any evidence of their crawling. I could be wrong, but Picscout is the sneakiest of the bunch and they left fingerprints if I'm right.
The only way to stop the corp clowns is to block hosting companies and punch holes thru that firewall as needed, and I'll chant that mantra until the day they pull my cold blue hands away from the keyboard.
Botnets are another story...
[edited by: incrediBILL at 8:15 am (utc) on Feb. 2, 2007]
Personnally, I'd deny the range if it was crawling my sites.
....................................
No offense, but I started out bot blocking like everyone else before I figured out blacklisting and post-mortem log analysus didn't work. You simply cannot detect stealth bots post-mortem and that much I was able to prove over 12 months ago.
Bill,
Just because you were unable to determine a particular bot or group of bots doesn't mean everybody has the same incapability. (no pun intended.)
Most everybody know that I'm quite over-bearing (to put in politely) in my denies.
Perhaps your sites (s) just get more traffic than mine and as a result the bulk presents too much mass for such analyzation?
The only way to stop the corp clowns is to block hosting companies
Agreed and with a suspicious pattern through my sites and pages, a hosting company would be denied as fast as a colo.
Just because you were unable to determine a particular bot or group of bots doesn't mean everybody has the same incapability.
Really?
OK, here's a couple of live ones today, maybe you can tell me what these are:
62.194.97.* [*.upc-h.chello.nl.] requested 109 pages as "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
89.98.135.* [unknown] requested 58 pages as "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)"
89.32.49.* [unknown] requested 56 pages as "Mozilla/4.0 (compatible; MSIE 6.0; Win32)"
24.168.26.* [cpe-24-168-26-*.si.res.rr.com.] requested 63 pages as "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
90.195.17.* [*.bb.sky.com.] requested 149 pages as "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
HINT: the number of requested pages is what the IP requested while already in CAPTCHA mode . The CAPTCHA was never answered, as a matter of fact, no CAPTCHA answer was even attempted, no human at the controls here.
My speculation is one or more of the following:
a) scraper
b) offline reader
c) downloader
d) scraper via botnet
e) something stealth like a copyright crawler or IRSBOT that doesn't want to be known
YMMV
Of course my capabilities to determine these bots are limited due to high traffic...
OK, here's a couple of live ones today, maybe you can tell me what these are:62.194.97.* [*.upc-h.chello.nl.] requested 109 pages as "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
89.98.135.* [unknown] requested 58 pages as "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)"
89.32.49.* [unknown] requested 56 pages as "Mozilla/4.0 (compatible; MSIE 6.0; Win32)"
24.168.26.* [cpe-24-168-26-*.si.res.rr.com.] requested 63 pages as "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
90.195.17.* [*.bb.sky.com.] requested 149 pages as "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
HINT: the number of requested pages is what the IP requested while already in CAPTCHA mode . The CAPTCHA was never answered, as a matter of fact, no CAPTCHA answer was even attempted, no human at the controls here.
My speculation is one or more of the following:
a) scraper
b) offline reader
c) downloader
d) scraper via botnet
e) something stealth like a copyright crawler or IRSBOT that doesn't want to be known
Can't help ya with these:
RewriteCond %{REMOTE_ADDR} ^(80¦82¦84¦8[6-9])\. [OR]
RewriteCond %{REMOTE_ADDR} ^(90¦91)\.
or this
deny from 62.
That leaves 24.168.26.*
I watch everything (like a hawk) that visits me from this provider.
For a long while, I had every IP range they had denied.
If one of their customers acts up again?
I resume the denial!
Looks like I'm useless as "#*$!x on a board" ;)
BTW, didn't you see the "(no pun intended)"?
That .ro UA should be easy to get rid of, since "Win32" isn't valid.
Good point.
I didn't notice that as I filter blocked things into several heaps such as "BAD AGENT", "DATA CENTER", "BLOCKED BOTS", etc. and everything else is typically a valid UA that exhibits bot-like behavior.
Now I have to go check my filter again.... sigh....
BTW, didn't you see the "(no pun intended)"?
Yes, but I thought it would be amusing to trot out couple a samples, see if anyone was up for the challenge (get it? challenge? captcha? nevermind...), and sure enough JD spotted something that slipped thru a crack!
[edited by: incrediBILL at 10:27 pm (utc) on Feb. 2, 2007]
Yes, but I thought it would be amusing to trot out couple a samples, see if anyone was up for the challenge
It was ;)
However I've had the end with denied for more than two years.
Thanks to your "trot" (widget keyword) however I may need to make an adjustment, because I saw some 2003 references in my data that contained rather than ends.