.htaccess block for MJ12bot

Forum Moderators: phranque

Message Too Old, No Replies

.htaccess block for MJ12bot

classifieds

10:16 am on Mar 18, 2008 (gmt 0)

I keep getting hit by multiple scrapers using "MJ12bot." I've added the UA string to my .htaccess filters but for some reason it's not matching.

For example, the httpd log file contains

220.246.54.157 - - [18/Mar/2008:05:11:32 -0400] "GET /chantilly_va-rs4359/ HTTP/1.1" 200 17577 "http://www.domain.com/chantilly_va-rs4359" "Mozilla/5.0 (compatible; MJ12bot/v1.2.1; hxxp://www.majestic12.co.uk/bot.php?+)"

and my .htacces contains:

RewriteCond %{HTTP_USER_AGENT} ^MJ12bo [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^MJ12bot [NC,OR]
RewriteRule .* - [F]

But it's not matching and I'm forced to use a "deny from" on the IP address range.

Any suggestions or recommendations would be greatly appreciated. These things keep overloading my server and it's taking lots of time away from other projects.

Receptional Andy

10:27 am on Mar 18, 2008 (gmt 0)

I think the problem is that you've start-anchored the pattern, so you would need the UA to start with MJ12bot.

I'm no expert, but try

RewriteCond %{HTTP_USER_AGENT} ^.*MJ12bot

Incidentally, I'm not sure if the bot you're seeing is actually from MJ12 (they've posted about fake versions [majestic12.co.uk]. Still, that bot has been around for years without any seemingly useful element for site-owners.

[edited by: Receptional_Andy at 10:31 am (utc) on Mar. 18, 2008]

classifieds

10:58 am on Mar 18, 2008 (gmt 0)

Thanks for the suggestion.

I added it so I should know in the next few hours if it works or not.

btw, IMHO the only good bot is a dead bot (minus msn, googlebot and slurp of course).

Receptional Andy

11:19 am on Mar 18, 2008 (gmt 0)

Just change the user agent yourself to test. If you're a firefox user try the "user agent switcher" add-on. Otherwise there are online tools to achieve the same thing.

jdMorgan

4:52 pm on Mar 18, 2008 (gmt 0)

Rather than using the "^.*" subpattern, you can just remove the start-anchor:


RewriteCond %{HTTP_USER_AGENT} MJ12bot

This is also true for end-anchors: Instead of matching "something.*$" just use "something" as the pattern.

Note that MJ12bot is a legitimate robot which reads and obeys robots.txt. However, it is currently being spoofed by others. There's a post by the owner in our robots.txt forum describing how to determine the legitimate MJ12bot from the spoofs.

Jim

Receptional Andy

7:42 pm on Mar 18, 2008 (gmt 0)

Rather than using the "^.*" subpattern, you can just remove the start-anchor

Thanks for the correction, Jim. When I posted mine I had a nagging thought telling me there was a much better way, but I couldn't figure it out at the time ;)

Lord Majestic

8:02 pm on Mar 18, 2008 (gmt 0)

Hi,

That's looks like one of ours - even though I can't say for sure as IP does not seem to match exactly, I will need to see bots request for robots.txt to be sure.

If you was to stop it crawling your site then all you need to do is to add it to robots.txt:

User-agent: MJ12bot
Disallow: /

If you want extra protection I can add it to a special no-crawl list, we can discuss it via sticky if you want.

Blocking by IP is not a smart move and generally blocking by HTACCESS bots that support robots.txt is not very smart either - at the very least allow robots.txt to be taken for the block to be efficiently implemented.

We indeed had a problem with fake bots pretending to be us - however this situation seems to have stopped in late January 2008 and I have not had any fake bot reports since then.

So, to reiterate - we obey robots.txt, this includes Crawl-Delay param that can help slow down crawling, also new bot that is being beta tested now supports GZIP to reduce bandwidth usage on sites, if you want to block us please please please use roborts.txt - this will be best for you and us.

Edit - Receptional Andy: we have released big index last month that I think is beneficial to site owners, I can't post much about it here though, but those who seek shall find.

[edited by: Lord_Majestic at 8:04 pm (utc) on Mar. 18, 2008]