Forum Moderators: phranque
* Spambots steal email addresses from our pages
* Bots eat up bandwidth and CPU by downloading the whole damn site at once
* Scraper bots steal content our to put on other websites
The problem is how to identify them. Most bad bots don't announce themselves by saying, "Hi, I'm a bot!" And you can't just whitelist bots you want, like Googlebot, because some bad bots simply impersonate Googlebot.
You could look through your logfile or your server stats to find IP's which have requested lots of pages, but by then the damage has already been done. The bot has already scoured your site.
So my idea is to write a script which will:
* look at the requests for pages as they come in
* see which IP's have been requesting lots of pages within certain timeframes
* check to see if those IP's belong to legitimate bots
* if not, send the visitor to a Turing test page
- - - If they pass, set an Access Granted cookie to whitelist them.
- - - If not, ban them via .htaccess, sending visitors to a page explaining the ban and how to lift it.
I've already written the first part, which logs page requests, checks for excessive requests, and logs any excessive requests it finds. I'll write the next soon. But before I do, I thought I'd check in with the WW community to verify that I'm going in the right direction, and to take advice & suggestions on how to make this project better.
I do intend to make my code public when I'm done. I realize that that shows my cards directly to the enemy, but I think it will benefit us more than them.
Here's the logic of my script so far:
The Perl script is called via SSI on any request for any page. (There's a page request about every 4 seconds.)
* Open my custom BotBuster.log
* Put the requests within the last 2 seconds, 1 minute, 1 hour, and 1 day into arrays
* Gather a list of unique IP addresses from the logfile
* Write the BotBuster.log back to the disk, appending the current page request, and skipping all entries more than 1 day old
* Check for excess requests within the last 2 seconds, 1 minute, 1 hour, and 1 day (thresholds are 3, 10, 30, and 50 respectively)
* Add IP's with excessive requests to a BadBots array
* Append those IP's to my BadBots.log, if they're not already in there
Right now my script is called via SSI, so it loads only when a *page* is requested, not any other file, and it logs only requests for pages, not other files. I'm conflicted over whether to continue this method, or to have it run periodically as a cronjob on the actual server log. There are pros & cons.
Advantages of running as an SSI, vs. as a cronjob on the server log:
* Logfile is much smaller because only page requests are logged (not graphics, .css, .js, favicon, etc.) so the logfiles are much smaller, and can be processed much faster and with much less memory. (A "real" logfile" is 33Mb, my custom logfile would be only around 4Mb.)
Downsides of running as an SSI, vs. as a cronjob on the server log:
* Can't control when the program runs. It'll run more frequently than necessary when there's heavy traffic.
* Misses any bots that go after images only.
Okay, discuss. :)
After sleeping on it I thought of some improvements.
I can do *both* SSI calls to a script *and* use a cronjob to get the best of both worlds.
- 1. Whenever a page is requested, an SSI logs the page request, but doesn't look for bots.
- 2. I have a separate cronjob that analyzes the logfile to check for bots.
This separation means that I'm not bogging down the server by running a bot check on every single page request. I can set the bot check to run every minute, or every 5 minutes, or whatever.
But putting the botcheck call into crontab means that the most frequently I can run is every minute. Crontab doesn't seem to support running a program more frequently than that. That's slower than I'd originally planned, but probably good enough.
Another improvement is to keep TWO logfiles: one for requests in the last hour, and another for requests in the last 24 hours. The latter file will be bigger and take more time and memory to process, so it's better if we run the bot check on *that* file infrquently. I'll analyze requests made in the last hour every minute, but requests made in the last 24 hours need only get analyzed every hour or so.
I use a honey-pot that immediately blocks the bad bot.
A link to this HP is placed on every page (dynamic php pages) and is made unique by using a random number as a parameter.?id=<?=rand(1,12000)?>
In a browser these links are hidden (display:none;), so a normal user can't click on them.
Visits to a site are either human or bot. A "good" bot is supposed to read robots.txt. HP urls are blocked in robots.txt.
Usually bad bot don't care about robots.txt. It doesn't that more than 10 requests to trigger the trap.
The honey-pot blocks the originating IP at the firewall level, so I don't have to bother to write a script to check against a list.
[webmasterworld.com...]
[webmasterworld.com...]
But a problem with this method is that bots get a green light as long as they obey robots.txt. I'm sure there are lots of bots I'd prefer to not have on my site, even if they don't go to disallowed directories.
The next step is then to allow only good bots in your robots.txt.
That still catches only bots that respect robots.txt AND announce themselves correctly. It doesn't catch robots that respect robots.txt and impersonate other bots. Granted, a bot that respects robots.txt probably isn't the same kind of bot that spoofs another bot's name, but it might be worth looking for what I'm actually looking for: lots of requests by bots I don't like.
Thanks for the link to the bot-tracking script. I do already have mine working, though, and I prefer mine because I'm familiar with it and can modify it over time as necessary.