Forum Moderators: open
However I see that most of my "bad bots" seems to be just curious users (humans clicking). Though I have a manual removal to "unban" humans, I never got, unbaned ips, so I guess humans that clicked my link to bot trap are not unbanning theirselves.
But the question is: How do I detect what are really bad bots based on report I get? (to avoid banning everybody)
I get some 40 reports monthly...
Example (put some X to ip to aviod upsetting somebody):
A bad robot hit /bot-trap/index.php 2007-03-11 (Sun) 15:34:42
address is X.14.192.9, hostname is X.14.192.9, agent is Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; InfoPath.1)
If I perform dns reverse lookup I get
"unable to resolve IP" in some cases
Other times I get some results like this:
X.48.109.139 resolves to
"X-48-109-139.speedy.com.ar"
Top Level Domain: "com.ar"
What is
A bad robot hit /bot-trap/index.php 2007-03-11 (Sun) 17:48:38
address is X.48.109.139, hostname is X-48-109-139.speedy.com.ar, agent is Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1)
This one seems to me like a common user.
But what things should I consider?
I have a bot trap (robots.txt banning a dir) and a visible link (just a dot) pointing to the trap. So I get a report of every visit to the forbidden directory, what should be those bots violating the robots.txt thus "bad bots".
Perhaps you may expand on what your asking here?
However I see that most of my "bad bots" seems to be just curious users (humans clicking). Though I have a manual removal to "unban" humans, I never got, unbaned ips, so I guess humans that clicked my link to bot trap are not unbanning theirselves.
Hogwash!
Form your perception, your have interested visitors going through your website (s) pages looking for obscure periods/dots (as opposed to BLANK gif's) to survey the links , rather than automated bots stumbling across a link that normal and interested visitors do not visualize as content?
But the question is: How do I detect what are really bad bots based on report I get? (to avoid banning everybody)I get some 40 reports monthly...
You compile records and/or compare these violations with previously accumulated lists of UA's.
As well as surveying the IP ranges and their providers that these violations enter your site (s) from.
Example (put some X to ip to aviod upsetting somebody):A bad robot hit /bot-trap/index.php 2007-03-11 (Sun) 15:34:42
address is X.14.192.9, hostname is X.14.192.9, agent is Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; InfoPath.1)If I perform dns reverse lookup I get
"unable to resolve IP" in some casesOther times I get some results like this:
X.48.109.139 resolves to
"X-48-109-139.speedy.com.ar"
Top Level Domain: "com.ar"What is
A bad robot hit /bot-trap/index.php 2007-03-11 (Sun) 17:48:38
address is X.48.109.139, hostname is X-48-109-139.speedy.com.ar, agent is Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1)This one seems to me like a common user.
These are useless examples because you have obfuscated the Class A range. The accepted practice in this forum is to obfuscated the Class D range.
See Brett's example in this thread:
IP posting guidelines as described by Brett (Msg #:3048609)
[webmasterworld.com...]
But what things should I consider?
Here's some basic reading of which you need to take some time absorbing and going through the extensive materials that are provided in sub-links of each threads responses:
Jim offers some extensive insights
[webmasterworld.com...]
Basic same inquiry as yours
[webmasterworld.com...]
[edited by: encyclo at 7:30 pm (utc) on April 18, 2007]
[edit reason] fixed broken link [/edit]
One of the links above is obsolete and doesn't work. :(
Jim
There are still plenty of UAs where the Platform is XP but there is no token for SP2: SV1, and yet the browsing pattern clearly indicates it's a human (or a very intelligent piece of software).
To me the absolute dead giveaway is the space before and after the first semi-colon.
Only the space before the semicolon is invalid. A semicolon should always be followed by a space, if the UA is conforming to Netscape's original UA-format specification.
Th only high-profile UA that is non-conformant is MSN. Due to a programming error, when MSN 9.0 is upgraded to MSN 9.1, part of the user-agent string is improperly-modified to "MSN 9.0;MSN 9.1;"
Jim
The basic problem is that you've left the bot-trap link visible to humans. Instead of link on a dot, try linking a space instead, and put the link at the beginning or end of another block-level tag, such as <p>.
Alternatively, you can use a transparent 1x1 .gif image, or cloak the link using server-side code.
I used a clickeable element just to avoid being penalized by any search engine specilly by google. Stylin' to be invisible seems risky to me. As much as using transparent gifs 1x1 don't you think?
What's benefitial and what is not? Anything violating robots.txt should be immediatly banned. Unfortunatelly humans clicking the trap don't unban theirselves, just a 3% does.
Example (put some X to ip to aviod upsetting somebody):
A bad robot hit /bot-trap/index.php 2007-03-30 (Fri) 03:51:12
address is 71.88.103.X, hostname is 71-88-103-X.dhcp.oxfr.ma.charter.com, agent is (ends here)
First case dns lookup result:
71.88.103.232 resolves to
"71-88-103-232.dhcp.oxfr.ma.charter.com"
Top Level Domain: "charter.com"
Another:
A bad robot hit /bot-trap/index.php 2007-03-30 (Fri) 17:37:48
address is 168.243.196.X, hostname is ip-cust-sv05074.telefonica-ca.net, agent is Mozilla/4.0 (compatible; MSIE 5.0; Windows NT)
Dns result:
168.243.196.74 resolves to
"ip-cust-sv05074.telefonica-ca.net"
Top Level Domain: "telefonica-ca.net
Example (put some X to ip to aviod upsetting somebody):A bad robot hit /bot-trap/index.php 2007-03-30 (Fri) 03:51:12
address is 71.88.103.X, hostname is 71-88-103-X.dhcp.oxfr.ma.charter.com, agent is (ends here)First case dns lookup result:
71.88.103.232 resolves to
"71-88-103-232.dhcp.oxfr.ma.charter.com"
Top Level Domain: "charter.com"
Golly Gee!
Many thanks for sharing the insight, however!
You stil have it wrong!
In the example you provided and to assist others, the correct format would utilize 3-X's for the Class D (#*$!), because the example is a three digit number, thus:
71-88-103-YYY would be appropiate.
Since we're not allowed to list IP ranges in their complete format, some uniformity in obscuring these numbers would prove benefical to either newcomers or less experienced deciphers' of IP ranges.
Many thanks.
Don
Even a link on a dot is too big and causes accidents if you can tab to that link.
I slap this bit of code at the top of the page and it can't be tabbed to or clicked on because it's size of 0 makes it non-existent in the browser, so no 'curious' clicks.
<iframe style="margin: 0px; padding: 0px; border-width: 0px" width="0" height="0" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"><a href="you_stupid_spider.html"></iframe>
The only thing you have to check on is if you_stupid_spider.html page is being accessed by FireFox or Google's Web Accelerator pre-fetch and let pre-fetch go untrapped and everything else gets snared.
To eliminate the pre-fetch problem entirely, just block pre-fetch in the .htaccess file and it's a non-issue. If some browser does pre-fetch and doesn't follow the current pre-fetch standards to let the server know it's pre-fetching and snares a real user, too bad.
Then you're pretty sure anything that hits you_stupid_spider.html should be blocked and the few people that go there after snooping in your HTML source get what they deserve as well for being so nosy.
[edited by: incrediBILL at 5:43 pm (utc) on April 8, 2007]
Does IncrediBILL's technique violate Google's hidden link prohibition?
Why would it?
I specify in robots.txt explicity:
Disallow: /you_stupid_spider.html
Besides, note the absence of anchor text?
Therefore, it has zero value to the search engine.
[edited by: incrediBILL at 5:01 pm (utc) on April 10, 2007]
if the user does not load that image and does not support session (aka 3 sessions from the same IP with no image), show them authentication page(could be a promotional sort of save on wigets if clicked here)...
then its a bot (or screen reader).
then just Prefetch and AOL left to be delt with