X-Robots Noindex or 403 Forbidden?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

X-Robots Noindex or 403 Forbidden?

Rysk100

2:07 pm on Jul 11, 2018 (gmt 0)

Further to my post on [webmasterworld.com...]

I added the x-robots tag no index directive to the http header response for the /app directory which Google was indexing and which I didn't want indexing.

Our Dev Ops guy though added a 403 response to this folder (which should have been there originally) and now from what I understand Google can't act upon the no index directive because of the 403

What should I do:

1. Keep the 403 response code - Google will eventually remove the URLs because of this
2. Remove the 403 response code so Google can see the x-robots no index directive

Robert Charlton

7:40 pm on Jul 11, 2018 (gmt 0)

Mod's note: I've changed the spelling in thread title from incorrect "No Index" to corrected "Noindex". Have left the spelling in the post unchanged.

Leosghost

8:05 pm on Jul 11, 2018 (gmt 0)

Our Dev Ops guy

..<= you have a "dev ops guy"* you have my sincere commiserations..

*one should never dignify such a spurious title / job description with even Camel Case..

If the folder cannot be accessed due to the 403..the directive cannot be read, by G, nor anyone / thing else, other than those with password access to your site..and any eventual hackers that might be intrigued enough to take a look..

<snort>dev ops guy</snort>

lucy24

8:45 pm on Jul 11, 2018 (gmt 0)

If you don't want Google crawling the /app directory, why can�t you simply disallow it in robots.txt? Sure, Google always kicks up a fuss when there's something it is not allowed to crawl, but the appropriate response is to ignore them. Is the directory filled with pages that are constantly getting linked from other people's sites, so there is a real danger of its content showing up in SERPs?

I may not want to know why two different people--in this case you and Dev Ops Guy--have the independent power to modify responses or govern access on the same site.

with even Camel Case

It's only Camel Case if you write it �DevOps�. Otherwise it�s Title Case.

Leosghost

8:53 pm on Jul 11, 2018 (gmt 0)

True..But..can / should "dev ops guy" be a "title" ( Use of Title Case" would imply such ) ..

I would consider it to be more of an expletive..

Rysk100

6:27 am on Jul 12, 2018 (gmt 0)

lucy24 - the /app folder contains the site's CSS / J.S etc - Google needs to be able to crawl these URLs see: [webmasterworld.com...]

Is anyone able to directly answer my question as to how best to remove the 100+ /app and /wp/wp-includes URLs that are now in Google's primary index

1. Keep the 403 response code - Google will eventually re-crawl and remove the URLs because of this response code
2. Remove the 403 response code so Google can see the x-robots no index directive

Leosghost

8:42 am on Jul 12, 2018 (gmt 0)

Already did answer that above..remove the 403..

Rysk100

8:56 am on Jul 12, 2018 (gmt 0)

Leosghost - thanks for the reply

Right, but Google also says that to remove a URL permanently you should 404/410 or block by requiring a password which in a round about way is what a 403 does
(see 'Make Removal Permanent' in [support.google.com...]

Google isn't explicitly saying 403, but its the same intention for a hard removal?

I say this as my R+D doesn't want these CMS directories /app, /wp/wp-includes open to users

keyplyr

8:57 am on Jul 12, 2018 (gmt 0)

What should I do

Your questions were answered in the last discussion you started.

If you block the folder with a 403, robots can't crawl to verify the noindex that would remove the files from being indexed.

But you went ahead and 403'd the folder anyway. And now you started another thread about the same issue.

Rysk100

9:10 am on Jul 12, 2018 (gmt 0)

I didn't want to block via robots.txt as the /app folder contains the actual sites CSS / J.S which I understand it is now best practice is to allow the crawling of.

I also didn't like the solution offered of disallowing crawling to the folder but with some exceptions for the CSS and JS file, as I read elsewhere that this isn't foolproof and Google will often take the first disallow as the stronger directive

So, I set up an x-robots tag - then our dev ops person (not me) did a 403 on the /app folder creating a new issue for me and thus a new thread

If you follow the chain of events 1) my questions in the 1st thread weren't really answered 2) this is now a separate issue because of the 403

Thanks anyway

keyplyr

9:22 am on Jul 12, 2018 (gmt 0)

Google will often take the first disallow as the stronger directive

You are misinformed. That is not accurate.

You can test your robots.txt in GSC.

not2easy

12:33 pm on Jul 12, 2018 (gmt 0)

I read elsewhere that this isn't foolproof and Google will often take the first disallow as the stronger directive

The instructions to be sure that the "Allow" follows the "Disallow" are from Google. I use them myself and I know that it works as expected. As mentioned, you can test and verify in your GSC account. The Header set X-Robots-Tag "noindex" does not prevent crawling of anything, but disallow does. If you disallow folders, or respond with a 403 error Google can't see the noindex header. Until they know that files in that folder should not be indexed, they will remain indexed.

A 403 response will eventually remove the files from the index, but noindex headers tell Google not to index the files. If you have files that you do not want to have indexed, the best way to manage those files is to move them to a folder and password protect and Disallow the folder. Then you can use Google's tools to remove URLs from the index as the indexed URLs will return a 404 error and requests for the URLs from Google will read the X-Robots headers that tell them to remove the URLs from the index. If you absolutely need to keep those files where they are then the choice is up to you.

In your case, the 403 response might prevent more damage because not all robots care about X-Robots headers and Disallow directives. If Google found and indexed them, they are likely in many other places.

lucy24

5:10 pm on Jul 12, 2018 (gmt 0)

Google also says

Tough on them. You don't have to do everything google says.

Qaeron

9:28 pm on Jul 12, 2018 (gmt 0)

Google aren't stupid thats for sure, there are millions of websites on wordpress that don't have this problem. If Google are indexing it there is something very very wrong maybe its time to ask your "DevOps" to explain exactly WTF you doing in the includes and why google have chosen to index it. Solve the problem Don't hide it!

wp-includes contains the core functionality of wordpress and basically shouldn't be played about with so if you've coded there its time to sack the "DevOps". Sometimes a badly coded functions.php file in a theme can allow malicious code to be uploaded to the includes folder and maybe its a hack. Google have given you an "Easter Egg" and your trying to hide it.

keyplyr

9:51 pm on Jul 12, 2018 (gmt 0)

not all robots care about X-Robots headers

Bing has said they do not support the X-Robots header

Rysk100

11:32 am on Jul 15, 2018 (gmt 0)

For those of you who actually read and responded to my question - thank you

phranque

10:00 pm on Jul 15, 2018 (gmt 0)

I say this as my R+D doesn't want these CMS directories /app, /wp/wp-includes open to users

what is the purpose of these directories?
blocking all requests for a set of urls is a different problem than removing those urls from the google index.

Qaeron

10:16 pm on Jul 15, 2018 (gmt 0)

I say this as my R+D doesn't want these CMS directories /app, /wp/wp-includes open to users

Good luck updating wordpress if your coding in wp-includes, you do know that there is a plugin folder for coding and hooking into includes right?

Honestly I think your R+D are full of #*$!

As for the app/ folder sounds like Magento install trying to combine with wordpress probably through fishpig. You do know woocommerce is better right?

My advice is be honest and comprehensive in your questions and not start multiple threads on the same topic.

I'm done answering you now tbh. 2 threads and you haven't listened to damn thing anyone said.

Rysk100

6:14 am on Jul 16, 2018 (gmt 0)

I'm sorry that these 2 posts have caused you such emotional pain. No one demanded that you reply to me
Good luck

tangor

6:31 am on Jul 16, 2018 (gmt 0)

There's no magic answer, other than the ones expressed. 403 as a control is not the usual answer to anything. Real question is ... what is there to hide?

phranque

7:24 am on Jul 16, 2018 (gmt 0)

No one demanded that you reply to me

same here but if you wanted my opinion on your OP you would reply to my reply:

what is the purpose of these directories?

Rysk100

8:06 am on Jul 16, 2018 (gmt 0)

/wp/wp-includes - this is the core WP directory. I have since blocked crawling to this folder via /robots.tx and Google has since de-indexed 90% of URLs from this folder

/app - this contains the site's css, images, plug-ins etc e.g /app/mu-plugins/amazon-web-services/vendor/aws/Monolog/
/app/plugins/anspress-question-answer/templates/js-template/
Google has indexed hundreds of these URLs

phranque

10:33 am on Jul 16, 2018 (gmt 0)

which of these directories contains resources necessary for properly rendering your content?
you probably want googlebot crawling such resources but not indexing them.

addressing the discussed solutions:

- the X-Robots-Tag is the best solution for googlebot (and any other search engines that support it) as it allows crawling but not indexing of resources necessary for rendering content.

- the 403 response code removes the url from the index eventually, but might make properly rendering content impossible.
what happens when non-googlebot user agents request these resources?
if you are showing googlebot a different response than a non-googlebot request, it might be seen as a form of cloaking.
btw the X-Robots-Tag is irrelevant with a 403 status code.

- disallowing the googlebot crawl with robots.txt will eventually remove these urls from the index most likely because googlebot discovered these within your documents as embedded resources (eg images) and external resources (eg css/js)
however disallowing the googlebot crawl with robots.txt might also make properly rendering content impossible.
typically if you disallow crawling for a path discovered in an anchor element, the url will remain indexed with the typical "A description for this result is not available because of this site's robots.txt" snippet.

basically the answer to your OP is your solution is likely correct and your DOg is wrong.
(assuming my assumptions about your applications are correct)

Rysk100

11:12 am on Jul 16, 2018 (gmt 0)

@phranque - that is great thank you.
Seems like the overall consensus is to serve Google the x-robots tag noindex directive removing the 403 server response on these /app folders (files that Google may need to render the site correctly). I will ask dev ops if there's another solution that can allow both the serving of the x-robots directive whilst stopping visitors from accessing files/folders under the /app directory.