Forum Moderators: phranque
I am developing a website. First locally, then on a test server so my customer can look at it. Ultimately I will move everything from the test server to the website of the customer.
What I would like to accomplish on my test server is that what I do will not appear on the listings of all search engines of the world. My question is simple, how would I block all search engines from finding the content of my test site, and subsequent appearing on the search engines search results.
I already have the following robots.txt:
# No robots should visit this site
User-agent: *
Disallow:
But the contents of robots.txt is not always used by robots. .htaccess also prohibits access to the website. But in this instance you need to exclude robots by name. Is there a catch all that prohibits ALL robots of accessing my test pages?
Macamba
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^A* [OR]
RewriteCond %{HTTP_USER_AGENT} ^a* [OR]
RewriteCond %{HTTP_USER_AGENT} ^B* [OR]
RewriteCond %{HTTP_USER_AGENT} ^b* [OR]
RewriteCond %{HTTP_USER_AGENT} ^C* [OR]
...
RewriteCond %{HTTP_USER_AGENT} ^Z* [OR]
RewriteCond %{HTTP_USER_AGENT} ^z*
RewriteRule ^(.*)$ http://www.robotstxt.org/
I now have excluded all user agents starting with all letters of the alphabet. But do I now not exclude to much? Like in my own web browser?
Macamba
User-agent: *
Disallow:
This robots.txt explicitly allows all bots to crawl your site. If you want to disallow them all then it should look like this:
User-agent: *
Disallow: /
If you use .htaccess then you should consider that you will also deny access to robots.txt as well, and thus many bots will assume it does not exist and therefore your site is okay to crawl: of course they will keep getting access denied and you will get lots of those entries in log analysis, but don't complain to bot masters since you have not provided publicly available robots.txt that would tell bots that they should go away.
Always allow robots.txt to be read by any bot!
That isn't what you want to do.
You just made your whole website have infinite duplicate content, and stopped everyone from seeing any of the real pages on the site.
This is not a good method to block access.
.
Try using .htpasswd to keep everyone except authorised people out.