410 + noindex, nofollow, noarchive - but page still indexed! - Google Search and SEO forum at WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

410 + noindex, nofollow, noarchive - but page still indexed!

partyark

12:50 pm on Nov 28, 2013 (gmt 0)

Really the title says it all.

We have a domain where all pages are served with the status 410. Its content is just a simple nofollow link to our core site. It's a typo domain essentially: it's a ".co" instead of a ".co.uk".

To be belt and braces the html has a noindex,nofollow,noarchive meta directive. But there it is in the SERPs.

Now there used to be a 'proper' page at that particular url. Its indexed content has now been replaced by the content of the 410 page. And the noarchive has been honoured. But no the noindex.

The take-away from this is:
1. You have an existing indexed page in the SERPs
2. For whatever reason you want to noindex it. Perhaps it has sensitive content.
3. You add the tag and change the content
4. ... don't be surprised if your new content appears anyway.

Oh and 5: 410 (and probably 404) may not help.

I suspect eventually the page will disappear. But worth mentioning nevertheless.

phranque

6:59 pm on Nov 28, 2013 (gmt 0)

what does "fetch as googlebot" in GWT say?

are you redirecting to a 410 page?
are you showing the 410 custom error page but serving a 200 status code?

lucy24

10:02 pm on Nov 28, 2013 (gmt 0)

:: head spinning ::

If the request receives a 410 (or 404), how can it even have a <meta robots> header? Any passing robot would have to assume that the page it receives-- assuming it even looks at it-- is the 410 page, not the page originally requested.

rainborick

12:48 am on Nov 29, 2013 (gmt 0)

If a URL returns a 404 or 410 HTTP Status code, I'm not surprised that any search engine would ignore the robots <meta> tag included in the HTML. (1) The server says the URL doesn't exist, so any noindex/nofollow/noodp instructions are meaningless. (2) The HTML could easily be a custom 404/410 that has a 'noindex/nofollow' instruction just in case a bot somehow tries to access the 404 page's HTML file directly.

It's not unusual for URLs to remain in the index for quite some time after they start to return a 404 or 410. One likely reason for this is to allow for inadvertent errors by webmasters, and if every URL in a domain suddenly started to return 410's I don't think it's unreasonable for Google to react very slowly and incrementally.

If you want this domain removed from the index, I think a better solution would be to either allow it to return a 200 Status code so that Googlebot will see the "noindex/nofollow" instruction(s), or use the URL Removal Tool.

JD_Toims

1:13 am on Nov 29, 2013 (gmt 0)

(1) The server says the URL doesn't exist...

No they don't -- They say the "content" associated with the URI is either unable to be found or has been removed.

The 410 response is primarily intended to assist the task of web maintenance by notifying the recipient that the resource is intentionally unavailable and that the server owners desire that remote links to that resource be removed.

[w3.org...]
In addition to saying the "content" was purposely removed, a 410 says "please remove the link to this page".

As far as not seeing the noindex, do you really think Google, Bing, etc. just discard the html of an error page without even a look?

Google can go back and tell someone the version of the page posted N weeks ago contained a noindex tag and that's why it was not currently indexed, even though the current version of the page did not contain noindex and the person thinking they had been penalized or something had no clue the tag was there previously when they asked.

They save and use everything they have at their disposal.

They get all the content from any error page, because they use GET requests not HEAD, the 404, 410 or other "hey, no info for you" (401, 402, 403, 405, 406) does not stop them from getting any HTML on the error page itself -- The server sends it along with the response headers.

If they're looking for "quality websites" to send people to, why would they not check and see if the site had useful, visitor friendly error pages or not while they have the full contents of the page right there in the system?

Added: If I had 10 links on the custom 404/410 page of a 100 page website, and a page you requested could not be found which would you "the visitor" think were the most important [top of the structural hierarchy] pages: the 90 I left unlinked or the 10 I decided to include?

Why would Google or Bing or anyone else throw that info away, since being search engines they have to try to understand hierarchy, structure and relationships within documents constantly?

lucy24

3:49 am on Nov 29, 2013 (gmt 0)

do you really think Google, Bing, etc. just discard the html of an error page without even a look

On my site, all error pages reference a separately named stylesheet. I work on the assumption that if a search engine looks at the error document, it will eventually request this stylesheet. Not concurrently, but definitely within a few days, and then periodically afterward. No request = it ignored the error document.

For comparison purposes, google reads most stylesheets on my site every day or two. Other search engines aren't as interested. As it happens, all stylesheets come with a noindex header. But search engines don't know this beforehand. There's also the possibility that a stylesheet will, in turn, reference an image file.

The first time I ever saw a known search engine requesting errorstyles.css was only a few weeks ago.* I remember mentioning it in a post. (It is hard to look this up retroactively, because the plainclothes bingbot also requests everything a human would get in response to a 403. So does yahoo slurp.)

* Detour to raw logs reveals it was 23 September, with no later pickups from the googlebot or anything like it. Search engines are exposed to my 403 page on a regular basis because I don't let them get midi files. They're much smaller than the 403 document; it's the principle of the thing. (Why are you crawling something you can't index? I don't let them have .sit or .zip files either.) And, thanks to a robots.txt booboo, the most active search engines also got 403 on some page requests earlier this month.

JD_Toims

3:59 am on Nov 29, 2013 (gmt 0)

Out of sheer curiosity, I might switch the header on the site in question to a 403 Forbidden rather than a 410 Gone with the same error page and see if it made any difference in Google's handling.

Any passing robot would have to assume that the page it receives-- assuming it even looks at it-- is the 410 page, not the page originally requested.

I can see where they could assume it's not the original contents of the page [better to check, compare, know imo, because there's no telling what goofiness a webmaster's going to dream up lol].

But they really can't assume it's the contents of the 404/410 page when the URL of the 404/410 page is requested directly, because it could very easily be there's a dynamic error page with contents that change based on the requested URI.

An example of what I think would be a great implementation of a custom error page would be on a product/directory site where the number of items listed changes causing different pagination so at times example.com/category/product-list/page-4 would be 404ed, but at others would be available.

In this type of situation, especially if a site had a number of categories, it would be very visitor-friendly to have a custom 404 page serve "the most popular product(s) listings" on the error page itself and also include links to the start of the "list" of products, the category page of the product-list, the home page of the site, the sitemap for the specific section of the site, and probably a couple of other "request uri specific" things I'm forgetting, as well as links to "overall key pages".

The first time I ever saw a known search engine requesting errorstyles.css was only a few weeks ago.*

In threads like these eg this one and titles on html elements, etc. I keep forgetting to add: And if not today, what about tomorrow? Why not assume *everything* counts or will and get it right now rather than thinking "oh, it doesn't matter" and then somewhere down the road it does, so not only is the site "behind", the only option is to go back and redo things that could have [usually IME] been taken care of much more easily initially.

The bot request of the style-sheet does seem it could indicate they may now be interested where pre-Sept. they weren't -- Of course, it could also be site raters are/were asked if they request a 404ed URL if the page is helpful or not and it's counted for much longer *or* they've been more interested in the contents of the page than how it looks previously -- In any case, done "right from the start" means there's no need to check, watch, worry or change anything.

partyark

8:50 am on Nov 29, 2013 (gmt 0)

Thanks for the replies.

"how can [the 410/404] even have a <meta robots> header?" - because it's got a page body.

What I thought was interesting was _not_ that the googlebot read the contents of the 410 page (no reason not to), it was that it replaced its old indexed content with the 410 content, despite the 'noindex' directive.

aakk9999

10:37 am on Nov 29, 2013 (gmt 0)

Interestingly, there was a recent thread reporting Google indexing page that is 301 redirected:

Google returns both correct and incorrect versions of 301ed url
http://www.webmasterworld.com/google/4623582.htm [webmasterworld.com]

I am wondering whether there is a bug somewhere on Google's side.

partyark

2:05 pm on Nov 29, 2013 (gmt 0)

Speculating, I would think that the process that handles de-indexing is quite separate to the one that handles content. So the content's being updated, but the no-index aspect hasn't kicked in yet.

On reflection, I'm quite pleased Google's done this. After all, the old content is gone, overwritten by the contents of the 'not here' page.

What it indicates is that if you simply change the http status of a page (301/302/404/410) _without changing the content too_ you're probably not achieving what you want. This is especially pertinent to duplicate pages.

phranque

4:43 pm on Nov 29, 2013 (gmt 0)

have you checked your server access logs to see the response received by googlebot?

what does "fetch as googlebot" in GWT say?

partyark

5:14 pm on Nov 29, 2013 (gmt 0)

Hi phranque - the fetch as google works as you'd expect - it reports 'Not found' and if you look at the html received it's correct too.

What's odd is that it says "The page seems to redirect to itself" :

The page seems to redirect to itself. This may result in an infinite redirect loop. Please check the Help Center article about redirects.

HTTP/1.1 410 Gone
Content-Type: text/html
Location: http://www.widget.com/123432
Date: Fri, 29 Nov 2013 17:19:28 GMT
Connection: close
Content-Length: 255

<!doctype html>
<html>
 <head>
 <meta charset=utf-8>
 <meta name="robots" content="noindex,nofollow,noarchive">
 <title></title>
 </head>
 <body>Go to <a href="http://www.widget.co.uk/" rel="nofollow">Widgetworld</body>
</html>

That seems like a bug, since clearly it's not redirecting to itself in any sense....

aakk9999

7:53 pm on Nov 29, 2013 (gmt 0)

The page seems to redirect to itself. This may result in an infinite redirect loop. Please check the Help Center article about redirects.

Maybe this is the reason why Googlebot is confused:

HTTP/1.1 410 Gone
Content-Type: text/html
Location: http://www.widget.com/123432
Date: Fri, 29 Nov 2013 17:19:28 GMT
Connection: close
Content-Length: 255

The "Location" headers are not normally sent with 410 response. So seeing the "Location", Googlebot *may* try to be too clever in thinking that you wanted redirect instead of serving 410 Gone. So if the URL you have tried to fetch is the same as the "Location:" returned in headers, this could explain the "redirect loop" message you got when tried WMT Fetch as Googlebot.

Why not remove "Location" headers you are sending with 410 and see what happens then?
.

partyark

10:53 pm on Nov 29, 2013 (gmt 0)

Ok, well that fixed the 'redirect loop' error. Thanks.

You know that's what's great about this place. Good spot aakk9999. One more thing to add to the best-practise list.

I think what's happening here is that Google's trying to second guess the intention of the server. It should really just ignore the Location, but I guess sees it and thinks 'this server is probably trying to redirect' even though its not a 300 series status.

For what it's worth, you get the Location if you 410 on IIS in the httpErrors section of web.config. The Location header can be removed using a rewrite outbound rule.

Sgt_Kickaxe

11:08 pm on Nov 29, 2013 (gmt 0)

You can't redirect to a 410(or 404) else Google records the jump page as a url. It then cannot see your noindex tag on that url since the url redirects. Round and round we go.

aakk9999

12:00 am on Nov 30, 2013 (gmt 0)

I think what's happening here is that Google's trying to second guess the intention of the server. It should really just ignore the Location, but I guess sees it and thinks 'this server is probably trying to redirect' even though its not a 300 series status.

Agree - although I would say "trying to guess the intention of the site developer", and yes, because it finds conflicting information between response code received and other HTTP headers received. Mind you, in your case it was IIS server that added Location automatically - so perhaps 'intention of the server' in this context is right!.

I think the same problem is in the second thread I linked to above [webmasterworld.com...] - over there the response 301 is arriving with cookies being set and the full page HTML even though the HTTP response code is 301 - which, we suspect over there in that thread, causes Google to ignore 301 and treats it as 200.

It would be nice if you could report back in a few weeks to let us know whether your URLs you are responding to with 410 Gone have started to drop off the SERPs after removing 'Location' headers.

partyark

10:36 am on Nov 30, 2013 (gmt 0)

I don't need to report back in a few weeks, they've already gone from the SERPs! I'd caution reading too much into this. I think what probably happened is just that the 'update content' process isn't the same as the 'noindex' process, and it took the latter a little longer to catch up.

But there is an interesting takeaway - that for all non-200 statuses be very careful about what and how you're doing it, don't try to be clever and certainly don't do anything non-standard, because you can't be sure how you're actions might be interpreted.

JD_Toims

11:04 am on Nov 30, 2013 (gmt 0)

But there is an interesting takeaway - that for all non-200 statuses be very careful about what and how you're doing it, don't try to be clever and certainly don't do anything non-standard, because you can't be sure how you're actions might be interpreted.

I definitely agree -- I've noted over the years Google *usually* doesn't "break standard" and their reps don't lie flat-out lie afaik -- The reps may talk in "vague Google speak", or something internally may have changed they're unaware of, but reps most often use, "Precise wording people have to interpret *exactly* the way it's said," to "get what they're saying" and have it actually be useful.

Again though, usually [meaning except for a few cases or specific situations, like the non-standard handling of 302's to 'fix the 302 bug' from years ago -- they had to 'break the HTML/W3 standards' to fix the 302 issue, which is, honestly, the first time I can remember them just plain throwing the standard out -- and some robots.txt issues when they would crawl disallowed URLs for some reason] they usually do things *exactly* according to the standards, so I definitely agree with, "Being sure to not 'throw a curve-ball'." as a great take-away from this thread.