Some of the product of my site have an additional page, with some characteristics.
Those pages have "noindex,noarchive,nofollow" header and are not listed in google, but they are reported in WMT as Soft 404
Same problem with search results, from the internal search engine, for queries without results. (the script is also noindex,nofollow,noarchive)
Should I worry?
I'm getting paranoid, since it seems I cannot get rid of Panda :(
netmeg
4:21 pm on Jan 7, 2015 (gmt 0)
Ok you do know the robots directives are different from the browser response codes, right? And that just because you put a NOINDEX on a page doesn't mean Google won't look at it (it does - it has to, to know not to index it) and that a NOINDEX does not automatically create a 404?
not2easy
4:38 pm on Jan 7, 2015 (gmt 0)
A soft 404 means that the page does not exist, but it does not return a 404. If these pages are being dynamically generated, you need to make sure your script handles non-existent pages with only a 404 and not alternative results. Google does not like Soft 404s, though 404s are seen as a normal part of the internet.
ergophobe
5:34 pm on Jan 7, 2015 (gmt 0)
I'm not sure if I'm reading you right, but as I read it you mean to say that these pages are not actually soft 404s, but GWT is seeing them that way?
Normally when Google flags a Soft 404 this means that Google is finding some text on the page that indicates you are trying to serve a 404 (so "404" or "Page not found"), the text is practically identical across a large number of pages and the HTTP Response code is not a 4xx but a 200 (commonly a 302 followed by a 200). Google then algorithmically decides that this must be a Soft 404.
So the first thing is, if they are valid pages that you actually want available for users, but not to be indexed, then you probably want to blcok the crawl. You might also get rid of any text that says 404 or Page Not Found or similar.
You say those pages are not listed in Google, which is presumably your goal. But then you say "same problem with search results from the internal search engine." I don't follow that. What exactly is the problem?
Or put another way, what is the current result and what is the desired result? Are you talking about how search results page from the internal search engine appear in google and you don't want that or about how your internal search engine does not show the special pages?
Assuming the former, which seems most likely in the context, you probably want to block all search results pages in robots.txt. If you noindex them, Google has to crawl them to see the NOINDEX as netmeg said, which is wasting a ton of crawl.
If you've already gotten those other pages out of the index that are showing as Soft 404s, then you would probably want to block them so you don't waste what crawl time Google is going to allocate to your site.
If, on the other hand, they actually ARE soft 404s, then you need to fix that and make them into real 404s or preferably 410s. NOINDEX, robots disallow or anything like that is not a solution to actual Soft 404s. The solution to a true Soft 404 is to make it into a real 4xx.
Mentat
8:17 pm on Jan 7, 2015 (gmt 0)
There are 2 situations:
A There are pages with very few content (some of them), depending on the product.
It's 200 CODE, and all have http headers "noindex, nofollow,noarchive".
Having very thin content, WMT reports them as "Soft 404" ====
B The other set of Soft 404 are the no result page, like when you are searching for "wfcwecvwevcercvewcew" => Sorry, no result found
Also, those pages are not indexed.
The POSSIBLE problems is that Google reports them as "Soft 404"
lucy24
8:42 pm on Jan 7, 2015 (gmt 0)
This is not the first time that GWT has revealed information that it was supposed to pretend it didn't have. Similarly, when you look up "sites that link to you" and then investigate the site, some of those links will actually be <nofollow>.
Check your logs. Google periodically asks for a garbage name such as "nhtuhtjkyt.html" to make sure the request gets a 404. They want to be sure the site is capable of returning a 404 response when the URL unambiguously doesn't exist. This seems to be 100% automated behavior, triggered whenever your site starts returning an unexpected number of 301s, regardless of cause.
ergophobe
10:05 pm on Jan 7, 2015 (gmt 0)
lucy24, Google is welcome to the information. NOINDEX doesn't mean "don't save any information about this" it means "don't put this in the search index and show it in search results"
If you do a site: search for a specific term and Google refuses to show it, they are doing what they should.
So I would just make two followup observations...
1. This is a signal the pages don't have enough content to count as pages, as I suggested in my first post.
2. You're wasting crawl time.
So if there is a generic way to block them, I would probably block them.
phranque
12:45 am on Jan 8, 2015 (gmt 0)
there are tradeoffs regarding robot exclusion vs noindexing. if you block googlebot from crawling these urls, google will not see your noindex directive. when google indexes these urls without crawling the content, the SERP descriptions will be:
A description for this result is not available because of this site’s robots.txt – learn more [support.google.com].
this is assuming google indexes these urls and they show up in a search result.
your ultimate solution depends on your priorities: - a "clean index" - preserving "crawl budget" for more important content - reducing GWT-reported "soft 404s"