Which is loaded first, robots.txt or htaccess?

Forum Moderators: goodroi

Message Too Old, No Replies

Which is loaded first, robots.txt or htaccess?

Hiding a test site from bots, can both be deployed?

dennisjensen

10:04 am on Nov 19, 2018 (gmt 0)

Hi,

I'm new here, excuse me, if I chose the wrong forum to post.

But, we're opening a test site. We don't want bots around just yet. We're deploying an x-robots-tag to stay out of SERP. But if we actually also wouldn't want bots on the site, could we disallow via robots.txt? Or would htacces and robots.txt exclude each other?

I can't wrap my head around that.

Kind regards
Dennis

justpassing

10:18 am on Nov 19, 2018 (gmt 0)

it depends what you want to do exactly. But htaccess rules are applied first before anything else, since it's what condition what the web server is going to do with the request.

The advantage of htaccess is that you can really block robots. Wherease, a robot reading the robots.txt may not obey it :)

Now all depends of the kind of rules you are putting in your htaccess.

dennisjensen

10:28 am on Nov 19, 2018 (gmt 0)

Thanks for passing by :)
Quick update, for my part in it I asked for an x-robots-tag stating no-index,nofollow for the test site. Whether other rules would be implied, I don't think so, but can't rule it out. But, are you saying: x-robots-tag would effectively keep the test site out of SERP and block bots from entering?

phranque

10:58 am on Nov 19, 2018 (gmt 0)

The ideal solution is to use HTTP Basic Authentication which provides a 401 status until the visitor/browser supplies credentials.

dennisjensen

11:02 am on Nov 19, 2018 (gmt 0)

Phranque, Thx, unfortunately, we can't make use of that solution.

justpassing

11:38 am on Nov 19, 2018 (gmt 0)

But, are you saying: x-robots-tag would effectively keep the test site out of SERP and block bots from entering?

Not exactly. I meant that if, in your htaccess, you have rules to block the access based on User-agent or/and IP , then it will prevent bots to access the pages.

If you set an a server header such as x-robots-tag, this is the same, as if you use robots.txt or set HTML equivalent tag. Which means that the robots will crawl the page, and IF they obey will not index them. With Googlebot / Bingbot, this "should" work. But if other robots are not obeying or even supporting such directive, this will not prevent them from indexing the pages.

Also, it's not because Google is not indexing a page, that it won't use the content...

dennisjensen

11:54 am on Nov 19, 2018 (gmt 0)

So, using htaccess to block bots is better than the x-robots-tag? Isn't blocking via htaccess not a bit like disallowing via robots.txt: Bots won't crawl, but still might list on SERP? Is it possible to block via htaccess and deploy x-robots-tag simultaneously?

NickMNS

1:35 pm on Nov 19, 2018 (gmt 0)

Robots.txt is like putting a sign on your door "Do Not Enter". Most will obey but many wont. Whereas, HTAccess is like installing a dead-bolt, most if not everybody will not be able to get in unless they have a key.

not2easy

2:49 pm on Nov 19, 2018 (gmt 0)

In my experience few will obey (like maybe a dozen) and most will ignore (like +hundreds) what you request via robots.txt - it does not prevent crawling and copying whatever is available. It is not intended for that.

If the site must be available to the public you can use tools that only apply to nice robots. Is there a reason the development needs to be done online on a live domain? It is pretty common to develop locally and then deploy when it's all ready for indexing.

dennisjensen

3:28 pm on Nov 19, 2018 (gmt 0)

@NickMNS - The deadbolt is very describing. But I also want SE's not telling the rest of the world, where to go see a deadbolt. So I want to stay out of the SERP (x-robots-tag), and I'd prefer not to have SE's snooping around (htaccess block). Can those two elements coexist without causing trouble?

@not2easy We're out of our local sandbox, testing with a limited number of clients.

lucy24

4:03 pm on Nov 19, 2018 (gmt 0)

Thread title:

Which is loaded first, robots.txt or htaccess?

htaccess is not �loaded�. It is a file read by the server before processing each and every request, including requests for robots.txt. Unless you've got the world's worst server setup, visitors--whether human or robot--cannot read the htaccess file.

If a search engine has never, ever crawled the site, there's really no need to worry about it showing up in the index, because why would it? Search engines have got millions of targets that they have seen, and that have links pointing to them.

The only exception is if your site name is some actual word or phrase; search engines that like exact matches might offer up the site just because the name precisely matches a search query. But for most names this is not really a concern. (I cite my test site, which I conventionally refer to as �Google Plays Silly Buggers�--not its actual name, but it works as an analogy. On rare occasions, people really will come in from other search engines. And then they're disappointed to find it is just a test site.)

It's a good idea to let everyone see robots.txt--and to Disallow universally--because the only thing better (for the server) than a blocked request is a request that was never made in the first place.

In a way, keeping access open can be useful, as it tells you which robots crawl even though they've been told not to. So you get a head start on compiling your block list.

not2easy

6:07 pm on Nov 19, 2018 (gmt 0)

Don't use .htaccess to block robots such as Google, Bing and other reputable bots. They will comply with your X-robots headers, but they won't see your X-robots if they are being blocked in htaccess. As lucy24 says, they'll never see your robots.txt file if they are blocked in htaccess.

dennisjensen

9:18 am on Nov 20, 2018 (gmt 0)

Thanks all of you, I'm very pleased about your feedback!
We'll get to work.