Forum Moderators: phranque
User-agent: Slurp
User-agent: Googlebot
User-agent: msnbot
User-agent: Mediapartners-Google
User-agent: Adsbot-Google
User-agent: ia_archiver-web.archive.org
Disallow: /forum/
User-agent: *
Disallow: /
User-agent: *
Disallow: /herring/
This has not worked, because google and yahoo keep poking their heads in, (notification by email that someone visited)
So I'm thinking of switching it to this:
User-agent: Slurp
User-agent: Googlebot
User-agent: msnbot
User-agent: Mediapartners-Google
User-agent: Adsbot-Google
User-agent: ia_archiver-web.archive.org
Disallow: /forum/
Disallow: /herring/
User-agent: *
Disallow: /
Disallow: /herring/
Disallow: /forum/
Does the second entry look correct?
The second code example is better, but will block all robots from all files, by virtue of the "User-agent: * -- Disallow: /" record. Therefore the Disallows which follow that line are redundant.
Many if not most robots don't support the multiple-user-agents-per-record format. I suggest using that format only for the "big three" or "big four" robots, and using explicit records or the wild-card record for all minor robots.
The robots.txt [webmasterworld.com] forum may be more appropriate for getting good answers to further questions.
Jim
I want to block all bots except for the
User-agent: Slurp
User-agent: Googlebot
User-agent: msnbot
User-agent: Mediapartners-Google
User-agent: Adsbot-Google
User-agent: ia_archiver-web.archive.org
And those bots i want to block from forum and herring since herring will be the bot trap and fourm indexing is a nightmare for indexing leading to supp results
I'm ultimately going to try and implement this solution at kloth.net/internet/bottrap.php
so I'm just taking it one step at a time.
[edited by: jdMorgan at 8:25 pm (utc) on Dec. 29, 2006]
[edit reason] De-linked [/edit]
I have questions about paths. When implementing it, on the index page, below the text that I've input for navigation back to the main site, if for some reason a user landed on the badrobot page, im getting:
Warning: fopen() [function.fopen]: open_basedir restriction in effect. File(/blacklist.dat) is not within the allowed path(s): ('.:/proc/uptime:/tmp:/home:/usr/local/lib/php:/nfs/home:/usr/home:/usr/local/bin/') in /home/domain/public_html/herring/index.php on line 16Warning: fopen(/blacklist.dat) [function.fopen]: failed to open stream: Operation not permitted in /home/domian/public_html/herring/index.php on line 16
Error opening file ...
My site has the basic layout. www.example.com with a hidden link to www.example.com/herring/index.php which has the bot trap.
I was also referred to this page [webmasterworld.com...]
Is the error message because I don't have sufficient privilages to that path?
67.121.255.255 - - [2006-12-29 (Fri) 14:10:50] "GET /herring/index.php HTTP/1.1" Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.1)
There a way to also get this pushed into the .htaccess file so that it will prevent them from accessing the site again.
It involves $fp = fopen($filename,'a+');
fwrite($fp,"$REMOTE_ADDR - - [$datum] \"$REQUEST_METHOD $REQUEST_URI $SERVER_PROTOCOL\" $HTTP_REFERER $HTTP_USER_AGENT\n");
fclose($fp);
And is there any harm in being able to pull up the .dat in a browser window?
I have everything in place, starting with an hidden link on my home page
<a href="/herring/index.php"><img src="images/pixel.gif" border="0" alt=" " width="1" height="1"></a>
And for whatever its worth, here is the current .htaccess file which is chmod to 644
SetEnvIf Request_URI "^(/403.*\.htm¦/robots\.txt)$" allowsome
<Files *>
order deny,allow
deny from env=getout
allow from env=allowsome
</Files>Options All Indexes
IndexOptions FancyIndexingOptions +FollowSymlinks
RewriteEngine on
rewritecond %{http_host} ^domain.com [nc]
rewriterule ^(.*)$ [domain.com...] [r=301,nc]RewriteCond %{THE_REQUEST} ^.*\/index\.htm?
RewriteRule ^(.*)index\.html?$ [domain.com...] [R=301,L]<Files 403.shtml>
order allow,deny
allow from all
</Files>
RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
.........
RewriteCond %{HTTP_USER_AGENT} ^XaldonRewriteRule ^.* - [F]
RewriteCond %{HTTP_REFERER} ^http://www.iaea.org$
RewriteRule!^http://[^/.]\.domain.com.* - [F]
This controls access to the site when a bad bot hits index.php.
Index.php is set up to pass the variable to this portion:
SetEnvIf Request_URI "^(/403.*\.htm¦/robots\.txt)$" allowsome
<Files *>
order deny,allow
deny from env=getout
allow from env=allowsome
</Files>
Here is the text of index.php which has 2 functions, one to add the ip address to .htaccess, and the other to send me an email notification and add the useragent to a blacklist, later checking it, although it may be a little redundant since there is no real need to check the blacklist if the ip address has been denied.
(forgive the sloppy paste together job, any optimization would be greatly appreciated, since I don't know all the syntax, I just found that it works for me so I'm not going to break it.)
This is chmod to 755
<?php
$lock_dir = $_SERVER["DOCUMENT_ROOT"] . "/lock";$filename = $_SERVER["DOCUMENT_ROOT"] . "/.htaccess";
$bad_bot_ip = str_replace(".", "\.", $_SERVER["REMOTE_ADDR"]);
$content = "SetEnvIf Remote_Addr ^" . $bad_bot_ip . "$ getout\r\n";
function make_lock_dir(){
global $lock_dir;
$key = @mkdir($lock_dir, 0777);
$i = 0;
while ($key === FALSE && $i++ < 20) {
clearstatcache();
usleep(rand(5,85));
$key = @mkdir($lock_dir, 0777);
return $key;
}
}
function write_ban(){
global $filename, $bad_bot_ip, $content, $lock_dir;
$handle = fopen($filename, 'r');
$content .= fread($handle,filesize($filename));
fclose($handle);
$handle = fopen($filename, 'w+');
fwrite($handle, $content,strlen($content));
fclose($handle);
rmdir($lock_dir);
print "Goodbye!";
}
function stale_check(){
global $lock_dir;
if (fileatime($lock_dir) < time()-120){
rmdir($lock_dir);
if (make_lock_dir()!== False) write_ban();
} else {
exit;
}
}
if (make_lock_dir()!== False) {
write_ban();
} else {
stale_check();
}
?>
<?php
if(phpversion() >= "4.2.0") {
extract($_SERVER);
}
?>
<html>
<head><title>Bad Robots </title></head>
<body>
<p>There is nothing here to see. So what are you doing here?</p>
<p><a href="http://www.domain.com">Go home.</a></p>
<?php
$badbot = 0;
/* scan the blacklist.dat file for addresses of SPAM robots
to prevent filling it up with duplicates */
$filename = "blacklist.dat";
$fp = fopen($filename, "r") or die ("Error opening file ... <br>\n");
while ($line = fgets($fp,255)) {
$u = explode(" ",$line);
if (ereg($u[0],$REMOTE_ADDR)) {$badbot++;}
}
fclose($fp);
if ($badbot == 0) { /* we just see a new bad bot not yet listed! */
/* send a mail to hostmaster */
$tmestamp = time();
$datum = date("Y-m-d (D) H:i:s",$tmestamp);
$from = "badbot-watch@domain.com";
$to = "webmaster@domain.com";
$subject = "Domainname: bad robot";
$msg = "A bad robot hit $REQUEST_URI $datum \n";
$msg .= "address is $REMOTE_ADDR, agent is $HTTP_USER_AGENT\n";
mail($to, $subject, $msg, "From: $from");
/* append bad bot address data to blacklist log file: */
$fp = fopen($filename,'a+');
fwrite($fp,"$REMOTE_ADDR - - [$datum] \"$REQUEST_METHOD $REQUEST_URI $SERVER_PROTOCOL\" $HTTP_REFERER $HTTP_USER_AGENT\n");
fclose($fp);
}
?>
</body>
</html>
<?php include($_SERVER['DOCUMENT_ROOT'] . "/herring/blacklist.php");?>
And then of course there is blacklist.php
<?php
if(phpversion() >= "4.2.0") {
extract($_SERVER);
}
$badbot = 0;
/* look for the IP address in the blacklist file */
$filename = "blacklist.dat";
$fp = fopen($filename, "r") or die ("Error opening file ... <br>\n");
while ($line = fgets($fp,255)) {
$u = explode(" ",$line);
if (ereg($u[0],$REMOTE_ADDR)) {$badbot++;}
}
fclose($fp);
if ($badbot > 0) { /* this is a bad bot, reject it */
sleep(12);
print ("<html><head>\n");
print ("<title>Site unavailable, sorry</title>\n");
print ("</head><body>\n");
print ("<center><h1>Welcome ...</h1></center>\n");
print ("<p><center>Unfortunately, due to abuse, this site is temporarily not available ...</center></p>\n");
print ("<p><center>If you feel this in error, send a mail to the hostmaster at this site,<br>
if you are an anti-social ill-behaving SPAM-bot, then just go away.</center></p>\n");
print ("</body></html>\n");
exit;
}
?>
Finally, I have my robots.txt set up like this:
User-agent: Slurp
User-agent: Googlebot
User-agent: msnbot
User-agent: Mediapartners-Google
User-agent: Adsbot-Google
User-agent: ia_archiver-web.archive.org
Disallow: /forum/
Disallow: /herring/User-agent: *
Disallow: /
Hopefully this helps someone out there.