Forum Moderators: phranque

Message Too Old, No Replies

Block "bot*" bot with .htaccess

         

SunnyCodes

8:10 am on Jun 22, 2015 (gmt 0)

10+ Year Member



Hi,

I have been trying to block one bot from my new site for more than a week with no success. I see the following in my Awstats file:

Unknown robot (identified by 'bot*')


I use the following .htaccess code:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^bot [NC]
RewriteRule .? - [F,L]


For the second line, I tried each of the following lines separately but the bot still kept coming.

RewriteCond %{HTTP_USER_AGENT} ^bot

RewriteCond %{HTTP_USER_AGENT} ^bot*

RewriteCond %{HTTP_USER_AGENT} bot\*

RewriteCond %{HTTP_USER_AGENT} bot[*]


Any ideas how to block that bot?

Thanks.

P.S. I have successfully blocked other bots, only this one cannot be blocked with the code I have tried so far.

whitespace

9:08 am on Jun 22, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



Unknown robot (identified by 'bot*')


I would take that to mean "bot" appears anywhere in the user-agent string, followed by some other character (this is not a regex, the * is just a generic wildcard - I'm pretty sure they are not implying a literal asterisk either). Note that this probably includes several different bots (plural), but they are grouped by this "catch all". So you might end up blocking more than you bargained for by blocking this generic pattern. (?) Later versions of Awstats seem to be more specific in the description (assuming this refers to the same "bot(s)"):

Unknown robot (identified by 'bot' followed by a space or one of the following characters _+:,.;/\-)


The problem is therefore with your regex.

^bot matches "bot" at the start of the user-agent, if it occurs elsewhere in the user-agent it will not match. (The ^ is the start of string anchor.)

Likewise ^bot* matches "bot", "bott", "bottt", "botttt", etc. at the start of the user-agent. The asterisk being a special char in the regex, matching zero or more occurrences of the previous character.

bot\* and botSQ*SQ (which I'm not sure how to type inline in this forum without it being pretty-printed?!) both match the literal string "bot*" anywhere in the user-agent - which is along the right lines, but (as mentioned above) I doubt that it is a literal asterisk in the user-agent string that you are trying to match.

So, I'm wondering why you didn't simply try:


RewriteCond %{HTTP_USER_AGENT} bot


That would match "bot" anywhere in the user-agent string. However, this might be a bit too general, so you could change this to match the updated description:


RewriteCond %{HTTP_USER_AGENT} bot[\ _+:,.;/\\-] [NC]


I've thrown in the NC flag, in case there might be a "Bot", "bOT" or "BOT", etc. (?) You could change the escaped space with the \s character class if it's easier to read.

Or, (maybe too general?) match any non-word character, or the underscore:


RewriteCond %{HTTP_USER_AGENT} bot[\W_] [NC]

[edited by: whitespace at 9:20 am (utc) on Jun 22, 2015]

wilderness

9:15 am on Jun 22, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The problem herein is not with the applicatiion of Regex, rather the interpretation of the correct 'bot' from a valid source.

Awstats should NOT be used, rather 'raw access logs', for the complete and correct UA.

SunnyCodes

11:24 am on Jun 22, 2015 (gmt 0)

10+ Year Member



@whitespace, your explanation of how the regex samples I used work helped me understand them much better, thank you.

I didn't use

RewriteCond %{HTTP_USER_AGENT} bot


because it will also block all good bots (Googlebot, bingbot,YandexBot, etc.).

RewriteCond %{HTTP_USER_AGENT} bot[\ _+:,.;/\\-]


The above will also block those good bots.

@wilderness, thanks for pointing that out. I noticed that my hosting account has an older version of Awstats and when I checked the raw access log file, from the look of it, it seems that bot* may be YandexBot or bingbot because they are the only ones that appear which have "bot" in their user agent and are not listed in Awstats by name.

I removed my bot-blocking codes for now and I will monitor the raw access file for a couple of days. It seems it was an unnecessary struggle which cost me a lot of time, I guess I will move my site to a more up to date hosting environment.

wilderness

2:55 pm on Jun 22, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I guess I will move my site to a more up to date hosting environment.


If your referring to another shared host?
The stats software's are all configured the same way (for the entire server), and generically to apply to all customers. There are not any custom configurations.

Raw access logs are most reliable.

lucy24

3:47 pm on Jun 22, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



^bot* matches "bot", "bott", "bottt", "botttt", etc. at the start of the user-agent.

Even worse, since it also matches "bo" ;)

bot_ followed by lowline? really? I wouldn't have thought of that, and would just have expressed the rule as "bot\b" except of course you can't, because what about legitimate robots? You have to put in exclusions.

whitespace

11:42 pm on Jun 22, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



Even worse, since it also matches "bo" ;)


Ha, yes, of course!

because it will also block all good bots (Googlebot, bingbot,YandexBot, etc.).


I think my comment, "you might end up blocking more than you bargained for", was a bit of an understatement!

keyplyr

9:52 am on Jul 6, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



RewriteCond %{HTTP_USER_AGENT} bot
RewriteCond %{HTTP_USER_AGENT} ! (bing|Google|msn|MSR|Twitter|Yandex)
RewriteRule !^robots\.txt$ - [F]

That first line can of course be broadened to cover more threats:
RewriteCond %{HTTP_USER_AGENT} (bot|crawler|scraper|sodomizer|spider)

YMMV :)