Forum Moderators: phranque
For example, the httpd log file contains
220.246.54.157 - - [18/Mar/2008:05:11:32 -0400] "GET /chantilly_va-rs4359/ HTTP/1.1" 200 17577 "http://www.domain.com/chantilly_va-rs4359" "Mozilla/5.0 (compatible; MJ12bot/v1.2.1; hxxp://www.majestic12.co.uk/bot.php?+)"
and my .htacces contains:
RewriteCond %{HTTP_USER_AGENT} ^MJ12bo [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^MJ12bot [NC,OR]
RewriteRule .* - [F]
But it's not matching and I'm forced to use a "deny from" on the IP address range.
Any suggestions or recommendations would be greatly appreciated. These things keep overloading my server and it's taking lots of time away from other projects.
I'm no expert, but try
RewriteCond %{HTTP_USER_AGENT} ^.*MJ12bot Incidentally, I'm not sure if the bot you're seeing is actually from MJ12 (they've posted about fake versions [majestic12.co.uk]. Still, that bot has been around for years without any seemingly useful element for site-owners.
[edited by: Receptional_Andy at 10:31 am (utc) on Mar. 18, 2008]
RewriteCond %{HTTP_USER_AGENT} MJ12bot
Note that MJ12bot is a legitimate robot which reads and obeys robots.txt. However, it is currently being spoofed by others. There's a post by the owner in our robots.txt forum describing how to determine the legitimate MJ12bot from the spoofs.
Jim
Rather than using the "^.*" subpattern, you can just remove the start-anchor
Thanks for the correction, Jim. When I posted mine I had a nagging thought telling me there was a much better way, but I couldn't figure it out at the time ;)
That's looks like one of ours - even though I can't say for sure as IP does not seem to match exactly, I will need to see bots request for robots.txt to be sure.
If you was to stop it crawling your site then all you need to do is to add it to robots.txt:
User-agent: MJ12bot
Disallow: /
If you want extra protection I can add it to a special no-crawl list, we can discuss it via sticky if you want.
Blocking by IP is not a smart move and generally blocking by HTACCESS bots that support robots.txt is not very smart either - at the very least allow robots.txt to be taken for the block to be efficiently implemented.
We indeed had a problem with fake bots pretending to be us - however this situation seems to have stopped in late January 2008 and I have not had any fake bot reports since then.
So, to reiterate - we obey robots.txt, this includes Crawl-Delay param that can help slow down crawling, also new bot that is being beta tested now supports GZIP to reduce bandwidth usage on sites, if you want to block us please please please use roborts.txt - this will be best for you and us.
Edit - Receptional Andy: we have released big index last month that I think is beneficial to site owners, I can't post much about it here though, but those who seek shall find.
[edited by: Lord_Majestic at 8:04 pm (utc) on Mar. 18, 2008]