Forum Moderators: open

Message Too Old, No Replies

Tips to detect what are bad bots in your bot trap

And what are humans clicking bot trap

         

silverbytes

1:55 pm on Mar 29, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I have a bot trap (robots.txt banning a dir) and a visible link (just a dot) pointing to the trap. So I get a report of every visit to the forbidden directory, what should be those bots violating the robots.txt thus "bad bots".

However I see that most of my "bad bots" seems to be just curious users (humans clicking). Though I have a manual removal to "unban" humans, I never got, unbaned ips, so I guess humans that clicked my link to bot trap are not unbanning theirselves.

But the question is: How do I detect what are really bad bots based on report I get? (to avoid banning everybody)

I get some 40 reports monthly...

Example (put some X to ip to aviod upsetting somebody):

A bad robot hit /bot-trap/index.php 2007-03-11 (Sun) 15:34:42
address is X.14.192.9, hostname is X.14.192.9, agent is Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; InfoPath.1)

If I perform dns reverse lookup I get
"unable to resolve IP" in some cases

Other times I get some results like this:

X.48.109.139 resolves to
"X-48-109-139.speedy.com.ar"
Top Level Domain: "com.ar"

What is
A bad robot hit /bot-trap/index.php 2007-03-11 (Sun) 17:48:38
address is X.48.109.139, hostname is X-48-109-139.speedy.com.ar, agent is Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1)

This one seems to me like a common user.

But what things should I consider?

wilderness

5:50 pm on Mar 31, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have a bot trap (robots.txt banning a dir) and a visible link (just a dot) pointing to the trap. So I get a report of every visit to the forbidden directory, what should be those bots violating the robots.txt thus "bad bots".

Perhaps you may expand on what your asking here?

However I see that most of my "bad bots" seems to be just curious users (humans clicking). Though I have a manual removal to "unban" humans, I never got, unbaned ips, so I guess humans that clicked my link to bot trap are not unbanning theirselves.

Hogwash!
Form your perception, your have interested visitors going through your website (s) pages looking for obscure periods/dots (as opposed to BLANK gif's) to survey the links , rather than automated bots stumbling across a link that normal and interested visitors do not visualize as content?

But the question is: How do I detect what are really bad bots based on report I get? (to avoid banning everybody)

I get some 40 reports monthly...

You compile records and/or compare these violations with previously accumulated lists of UA's.
As well as surveying the IP ranges and their providers that these violations enter your site (s) from.

Example (put some X to ip to aviod upsetting somebody):

A bad robot hit /bot-trap/index.php 2007-03-11 (Sun) 15:34:42
address is X.14.192.9, hostname is X.14.192.9, agent is Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; InfoPath.1)

If I perform dns reverse lookup I get
"unable to resolve IP" in some cases

Other times I get some results like this:

X.48.109.139 resolves to
"X-48-109-139.speedy.com.ar"
Top Level Domain: "com.ar"

What is
A bad robot hit /bot-trap/index.php 2007-03-11 (Sun) 17:48:38
address is X.48.109.139, hostname is X-48-109-139.speedy.com.ar, agent is Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1)

This one seems to me like a common user.

These are useless examples because you have obfuscated the Class A range. The accepted practice in this forum is to obfuscated the Class D range.
See Brett's example in this thread:

IP posting guidelines as described by Brett (Msg #:3048609)
[webmasterworld.com...]

But what things should I consider?

You need to determine what visitors are benefical or detrimental to your own website (s).
In one of the examples you provided the visitor was from an Argentina based IP range?
Do you have a market share in Argentina?

Here's some basic reading of which you need to take some time absorbing and going through the extensive materials that are provided in sub-links of each threads responses:

Jim offers some extensive insights
[webmasterworld.com...]

Basic same inquiry as yours
[webmasterworld.com...]

[edited by: encyclo at 7:30 pm (utc) on April 18, 2007]
[edit reason] fixed broken link [/edit]

jdMorgan

6:14 pm on Mar 31, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The basic problem is that you've left the bot-trap link visible to humans. Instead of link on a dot, try linking a space instead, and put the link at the beginning or end of another block-level tag, such as <p>.
Alternatively, you can use a transparent 1x1 .gif image, or cloak the link using server-side code.

One of the links above is obsolete and doesn't work. :(

Jim

thetrasher

6:55 pm on Mar 31, 2007 (gmt 0)

10+ Year Member



Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1)

This one seems to me like a common user.

That's obviously a bot:
1.) Too many spaces in user agent!
2.) XP without service pack?

GaryK

8:42 pm on Mar 31, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I see approximately 500 new user agents every week. Lots of them are for IE.

There are still plenty of UAs where the Platform is XP but there is no token for SP2: SV1, and yet the browsing pattern clearly indicates it's a human (or a very intelligent piece of software).

To me the absolute dead giveaway is the space before and after the first semi-colon.

jdMorgan

9:15 pm on Mar 31, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



That's a 'bot, but not a good one. Or it might be a content filter -- No way to tell.

Only the space before the semicolon is invalid. A semicolon should always be followed by a space, if the UA is conforming to Netscape's original UA-format specification.

Th only high-profile UA that is non-conformant is MSN. Due to a programming error, when MSN 9.0 is upgraded to MSN 9.1, part of the user-agent string is improperly-modified to "MSN 9.0;MSN 9.1;"

Jim

Achernar

12:15 am on Apr 1, 2007 (gmt 0)

10+ Year Member Top Contributors Of The Month



Why don't you style your bot trap links to hide them to humans?
Mine (also a dot :) ) is styled like this: style="display:none;"

wilderness

1:15 am on Apr 1, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Ya'll been on an Ocean cruise for two days or have you just returned from being stranded in the Sahara ;)

A third possibility is that all your internet providers were down simultaneously ;)

Don

youfoundjake

1:49 am on Apr 1, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I think that if anything (bot or person) gets caught in my bot trap, they should remain banned. My trap is a clear gif, pointing to a directory blocked by robots.txt. If they trip the trap they get banned, that simple. If someone is looking at my source code, and they follow the link, thats on them that they get banned. I'm not going to disable right click or have javascript remove the tool bars, because I'm not that anal about my code. If they are looking at source code information, the presumption can be made that they understand what they are looking at, which means that a 1px gif or "hidden link" is used for a bot trap or tracking, or for whatever other reasons. So if they trigger it by clicking on it, the intent can't be honorable.
Just my two cents, I posted what I did for my trap here in the forums, so that it bans by ip address, and not by user agent.

silverbytes

3:11 pm on Apr 3, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks you all. First, yes I'm trying to ban bad bots only and not curious persons.

The basic problem is that you've left the bot-trap link visible to humans. Instead of link on a dot, try linking a space instead, and put the link at the beginning or end of another block-level tag, such as <p>.
Alternatively, you can use a transparent 1x1 .gif image, or cloak the link using server-side code.

I used a clickeable element just to avoid being penalized by any search engine specilly by google. Stylin' to be invisible seems risky to me. As much as using transparent gifs 1x1 don't you think?

What's benefitial and what is not? Anything violating robots.txt should be immediatly banned. Unfortunatelly humans clicking the trap don't unban theirselves, just a 3% does.

Achernar

4:09 pm on Apr 3, 2007 (gmt 0)

10+ Year Member Top Contributors Of The Month



If you use an external style sheet, create a rule to hide the bot trap links.

silverbytes

2:55 pm on Apr 4, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Here's a new example following netiquette:

Example (put some X to ip to aviod upsetting somebody):

A bad robot hit /bot-trap/index.php 2007-03-30 (Fri) 03:51:12
address is 71.88.103.X, hostname is 71-88-103-X.dhcp.oxfr.ma.charter.com, agent is (ends here)

First case dns lookup result:
71.88.103.232 resolves to
"71-88-103-232.dhcp.oxfr.ma.charter.com"
Top Level Domain: "charter.com"

Another:

A bad robot hit /bot-trap/index.php 2007-03-30 (Fri) 17:37:48
address is 168.243.196.X, hostname is ip-cust-sv05074.telefonica-ca.net, agent is Mozilla/4.0 (compatible; MSIE 5.0; Windows NT)

Dns result:
168.243.196.74 resolves to
"ip-cust-sv05074.telefonica-ca.net"
Top Level Domain: "telefonica-ca.net

wilderness

4:49 pm on Apr 4, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Example (put some X to ip to aviod upsetting somebody):

A bad robot hit /bot-trap/index.php 2007-03-30 (Fri) 03:51:12
address is 71.88.103.X, hostname is 71-88-103-X.dhcp.oxfr.ma.charter.com, agent is (ends here)

First case dns lookup result:
71.88.103.232 resolves to
"71-88-103-232.dhcp.oxfr.ma.charter.com"
Top Level Domain: "charter.com"

Golly Gee!
Many thanks for sharing the insight, however!

You stil have it wrong!

In the example you provided and to assist others, the correct format would utilize 3-X's for the Class D (#*$!), because the example is a three digit number, thus:

71-88-103-YYY would be appropiate.

silverbytes

6:55 pm on Apr 4, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Ok I'll try to improve that. But what's the difference for this case really?

wilderness

7:19 pm on Apr 4, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



In this instance there is no difference.
The Class C of the provider 96-111 is all by OXF-MA.

Since we're not allowed to list IP ranges in their complete format, some uniformity in obscuring these numbers would prove benefical to either newcomers or less experienced deciphers' of IP ranges.

Many thanks.

Don

GaryK

6:01 am on Apr 5, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Since three X's will invoke WebmasterWorld's dirty words filter (#*$!) I prefer to use N's to indicate the number of digits I've replaced in order to obscure the IP Address. :)

incrediBILL

5:39 pm on Apr 8, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



OK, why are you putting anything on a page someone could click on?

Even a link on a dot is too big and causes accidents if you can tab to that link.

I slap this bit of code at the top of the page and it can't be tabbed to or clicked on because it's size of 0 makes it non-existent in the browser, so no 'curious' clicks.

<iframe style="margin: 0px; padding: 0px; border-width: 0px" width="0" height="0" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"><a href="you_stupid_spider.html"></iframe>

The only thing you have to check on is if you_stupid_spider.html page is being accessed by FireFox or Google's Web Accelerator pre-fetch and let pre-fetch go untrapped and everything else gets snared.

To eliminate the pre-fetch problem entirely, just block pre-fetch in the .htaccess file and it's a non-issue. If some browser does pre-fetch and doesn't follow the current pre-fetch standards to let the server know it's pre-fetching and snares a real user, too bad.

Then you're pretty sure anything that hits you_stupid_spider.html should be blocked and the few people that go there after snooping in your HTML source get what they deserve as well for being so nosy.

[edited by: incrediBILL at 5:43 pm (utc) on April 8, 2007]

yodokame

3:07 pm on Apr 10, 2007 (gmt 0)

10+ Year Member



Does IncrediBILL's technique violate Google's hidden link prohibition? We get virtually all our traffic from Google, so even a small misunderstanding with them that gets quickly rectified could cost us.

volatilegx

4:51 pm on Apr 10, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> Does IncrediBILL's technique violate Google's hidden link prohibition?

I don't think it does. This type of linking is SEO-neutral. You are simply linking to a file on the same domain.

incrediBILL

5:00 pm on Apr 10, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Does IncrediBILL's technique violate Google's hidden link prohibition?

Why would it?

I specify in robots.txt explicity:

Disallow: /you_stupid_spider.html

Besides, note the absence of anchor text?

Therefore, it has zero value to the search engine.

[edited by: incrediBILL at 5:01 pm (utc) on April 10, 2007]

blend27

2:02 pm on Apr 18, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Why not load an image 1x1(call it spacerNNN.gif) thats generated on the server everytime the page is accessed and loaded first using a div and CSS where NNN is a random number this way its not cached by the browser.

if the user does not load that image and does not support session (aka 3 sessions from the same IP with no image), show them authentication page(could be a promotional sort of save on wigets if clicked here)...

then its a bot (or screen reader).

then just Prefetch and AOL left to be delt with