Homegrown script to keep bots out of site

We all have our reasons for wanting to keep rogue bots out of our site, take your pick:

* Spambots steal email addresses from our pages
* Bots eat up bandwidth and CPU by downloading the whole damn site at once
* Scraper bots steal content our to put on other websites

The problem is how to identify them. Most bad bots don't announce themselves by saying, "Hi, I'm a bot!" And you can't just whitelist bots you want, like Googlebot, because some bad bots simply impersonate Googlebot.

You could look through your logfile or your server stats to find IP's which have requested lots of pages, but by then the damage has already been done. The bot has already scoured your site.

So my idea is to write a script which will:

* look at the requests for pages as they come in
* see which IP's have been requesting lots of pages within certain timeframes
* check to see if those IP's belong to legitimate bots
* if not, send the visitor to a Turing test page
- - - If they pass, set an Access Granted cookie to whitelist them.
- - - If not, ban them via .htaccess, sending visitors to a page explaining the ban and how to lift it.

I've already written the first part, which logs page requests, checks for excessive requests, and logs any excessive requests it finds. I'll write the next soon. But before I do, I thought I'd check in with the WW community to verify that I'm going in the right direction, and to take advice & suggestions on how to make this project better.

I do intend to make my code public when I'm done. I realize that that shows my cards directly to the enemy, but I think it will benefit us more than them.

Here's the logic of my script so far:

The Perl script is called via SSI on any request for any page. (There's a page request about every 4 seconds.)

* Open my custom BotBuster.log
* Put the requests within the last 2 seconds, 1 minute, 1 hour, and 1 day into arrays
* Gather a list of unique IP addresses from the logfile
* Write the BotBuster.log back to the disk, appending the current page request, and skipping all entries more than 1 day old
* Check for excess requests within the last 2 seconds, 1 minute, 1 hour, and 1 day (thresholds are 3, 10, 30, and 50 respectively)
* Add IP's with excessive requests to a BadBots array
* Append those IP's to my BadBots.log, if they're not already in there

Right now my script is called via SSI, so it loads only when a *page* is requested, not any other file, and it logs only requests for pages, not other files. I'm conflicted over whether to continue this method, or to have it run periodically as a cronjob on the actual server log. There are pros & cons.

Advantages of running as an SSI, vs. as a cronjob on the server log:

* Logfile is much smaller because only page requests are logged (not graphics, .css, .js, favicon, etc.) so the logfiles are much smaller, and can be processed much faster and with much less memory. (A "real" logfile" is 33Mb, my custom logfile would be only around 4Mb.)

Downsides of running as an SSI, vs. as a cronjob on the server log:

* Can't control when the program runs. It'll run more frequently than necessary when there's heavy traffic.
* Misses any bots that go after images only.

Okay, discuss. :)

Homegrown script to keep bots out of site

Here's how I plan to do it

MichaelBluejay

MichaelBluejay

Achernar

MichaelBluejay

Achernar

MichaelBluejay

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week