Redirecting all 404s to home page - good or bad? - Webmaster General forum at WebmasterWorld

Forum Moderators: phranque

Message Too Old, No Replies

Redirecting all 404s to home page - good or bad?

brix76

6:34 pm on Nov 28, 2011 (gmt 0)

Hi all,

I have been wondering if redirecting all 404s directly to my homepage is good?
Or having a custom 404 page is better.
Some say, redirecting everything to my homepage could make the robots think I have a lot of duplicate content and rank me lower.
Other say it is indeed search engine friendly - less 404s, better ranking.

What do you guys think - redirect all to homepage or nicely done custom 404 page?

Thanks a lot.

tangor

1:53 pm on Nov 30, 2011 (gmt 0)

Where do you get this information?

It's part of google's verification process... sites should return 404's... and when they don't, particularly with a google generated known by them to be bad filename, "it becomes problematic".

pageoneresults

2:23 pm on Nov 30, 2011 (gmt 0)

Wow, I seem to have started quite a discussion. happy!

We do this to all new WebmasterWorld Members, it's part of your initiation. :)

So, all the 404s that GWT finds are for articles or news stories that have been deleted.

If there are no suitable replacements for the articles and/or stories that have been deleted, and they are GONE forever, a 410 is suggested as it is more specific than a 404.

In your case, I'd surely look for replacements if they exist. Usually articles and/or stories have some inbound link equity attached to them. If you can, you want to take that equity and permanently redirect (301) it to a suitable replacement. If not, do the 410. Google does not like when sites return a large number of 404s, especially when there are a sufficient number of links pointing to those documents that are now gone.

In regards to the 301, you may want to think really hard about finding a suitable replacement for documents that have obtained link equity and trust. If you need to, at least put the user at the category level where the document previously existed. That is typically relevant and may help you to redistribute the link equity.

Is it better to do 410 instead of 404 for those? And how is that set up?

Yes, I'd suggest the 410 for anything that is GONE forever. It's set up similar to a 404, you're just sending a different Status Code that is more succinct.

And is it possible to have it set up so they land on a 410 and then in a few seconds get redirected to my homepage?

I wouldn't suggest that. Use the custom 410 document to your advantage. Many will capture the referring information and serve a dynamic 410 based on that. They'll provide maybe 3 suggestions to the user and a search box so the user can dig further if they wish. If you can capture what brought them to the 410 document and serve dynamic content based on that, you put the user that much closer to what they were originally looking for.

enigma1

2:35 pm on Nov 30, 2011 (gmt 0)

Problematic for what? for the google verification? Actually you don't say the whole story about it. That is one of the verification methods google has and apparently I never used. I go for meta verification as in my case is configurable without adding files to the server.

And I don't see the relevancy with the general 404 vs 301 we are discussing. Do you have some documentation where google forces webmasters to generate 404s? or even 301s

tangor

3:29 pm on Nov 30, 2011 (gmt 0)

WMT wants an auth code/file on your system. Probably have that installed. As for the other, my best documentation is my raw logfiles and the recurring requests for three ALWAYS DIFFERENT non-existent files each month from g's bot. These files started after setting up wmt with g some years back. All get a 404 because the requested pages do not exist on the site(s).

Seem to recall reading comments here at WW some years back, but I've slept since then and brain's fuzzy.

pageoneresults

3:35 pm on Nov 30, 2011 (gmt 0)

Seem to recall reading comments here at WW some years back, but I've slept since then and brain's fuzzy.

Google IP requesting site verification file -- in several different variations
Sep 25, 2006 - [WebmasterWorld.com...]

The check for the non-existent file is to make sure that the server returns a 404 for a file that doesn't exist (if the server returns a 200, then we have no way of knowing if the verification file actually exists or if the server just returns a 200 for everything).
Vanessa Fox

301 > 200

Same thing in this scenario.

tangor

3:54 pm on Nov 30, 2011 (gmt 0)

Thanks, pageoneresults! That's the one I was thinking of.

I actually make use of several gone files which return 404s as a scraper honeypot. Most other files removed are permanently 301 to replacement or category pages. 404s also help in refining .htaccess to ban bad actors. I find 404s useful... Relating my experience, not offering argument.

That said, on most sites I manage, the standard 404 file is returned, not a custom 404. On the few sites where I have a custom 404 with a clear link to the index page and/or a local site search box, I manage to keep about 30% of the traffic for at least one more page on site.

enigma1

3:57 pm on Nov 30, 2011 (gmt 0)

requests from googlebot can be of many sources. Unless GWT show an error somewhere I have no any evidence they do on purpose invalid requests.

Also:
301 > 200 != 200

404 and 301 in that case state the same, link is no longer valid.

pageoneresults

4:27 pm on Nov 30, 2011 (gmt 0)

Unless GWT show an error somewhere I have no any evidence they do on purpose invalid requests.

You wouldn't see them since you're serving a 301>200 for ALL invalid requests. Also, I don't recall seeing these types of "Google generated errors" appearing in GWT. Since I don't have them, I wouldn't see them, huh?

404 and 301 in that case state the same, link is no longer valid.

My understanding is this...

10.4.5 404 Not Found
[W3.org...]

The server has not found anything matching the Request-URI. No indication is given of whether the condition is temporary or permanent. The 410 (Gone) status code SHOULD be used if the server knows, through some internally configurable mechanism, that an old resource is permanently unavailable and has no forwarding address. This status code is commonly used when the server does not wish to reveal exactly why the request has been refused, or when no other response is applicable.

10.4.11 410 Gone
[W3.org...]

The requested resource is no longer available at the server and no forwarding address is known. This condition is expected to be considered permanent. Clients with link editing capabilities SHOULD delete references to the Request-URI after user approval. If the server does not know, or has no facility to determine, whether or not the condition is permanent, the status code 404 (Not Found) SHOULD be used instead. This response is cacheable unless indicated otherwise.

The 410 response is primarily intended to assist the task of web maintenance by notifying the recipient that the resource is intentionally unavailable and that the server owners desire that remote links to that resource be removed. Such an event is common for limited-time, promotional services and for resources belonging to individuals no longer working at the server's site. It is not necessary to mark all permanently unavailable resources as "gone" or to keep the mark for any length of time -- that is left to the discretion of the server owner.

10.3.2 301 Moved Permanently
[W3.org...]

The requested resource has been assigned a new permanent URI and any future references to this resource SHOULD use one of the returned URIs. Clients with link editing capabilities ought to automatically re-link references to the Request-URI to one or more of the new references returned by the server, where possible. This response is cacheable unless indicated otherwise.

The new permanent URI SHOULD be given by the Location field in the response. Unless the request method was HEAD, the entity of the response SHOULD contain a short hypertext note with a hyperlink to the new URI(s).

If the 301 status code is received in response to a request other than GET or HEAD, the user agent MUST NOT automatically redirect the request unless it can be confirmed by the user, since this might change the conditions under which the request was issued.

These are basic protocols that have been established for many years. You're welcome to choose whichever HTTP responses you want. Many will tell you that choosing the wrong response can have detrimental effects e.g. in your case where you have no 404s being reported in GWT. I have no personal experience with this type of scenario. My educated guess is that you're having indexing challenges that you may not be aware of. The best way to test this is to fix what you have set up now.

I'd also be willing to set up a test page of 1,000 URIs (301>200) pointed to your site with my choice of path names and anchor text. Are you that sure that there would be "zero effect"? ;)

tangor

4:29 pm on Nov 30, 2011 (gmt 0)

The link above and quote is from a now ex google employee. How g is handling all 200 (meaning no 404) in wmt I don't know, but it is clear that at one time they really wanted to know if the site/server would return 404s.

301 and 404 are not the same thing, though the requested link is certainly no longer valid. :)

pageoneresults

4:38 pm on Nov 30, 2011 (gmt 0)

The requested resource has been assigned a new permanent URI.

So, a 301 is basically saying...

"HEY! There used to be a resource here related to the link you clicked but it has been moved and assigned a new "permanent" URI. Please update your link to reflect the new destination URI."

It's a signal that there "used to be" a document at the location requested. Now, let's go back to my offering of linking to 1,000 non existent pages on your site. Are you sure you want to allow me the opportunity to associate certain anchor text (along with other signals) with your site? In theory, I should be able to totally bork certain aspects of your indexing routines. I should be able to associate specific anchor text with the destination of your 301>200 procedures e.g. your home page. I could think of all sorts of mischief that could be done. You? :(

[edited by: pageoneresults at 4:38 pm (utc) on Nov 30, 2011]

brix76

4:38 pm on Nov 30, 2011 (gmt 0)

Thanks a lot pageoneresults for the thorough comment. People were getting a bit off topic.
Do you by any chance know why in GWMT, some of the links displayed in the "Crawl errors" are in blue and most of them are in black?
Does that have to do with, whether the link is internal or external?
Thanks.

pageoneresults

4:44 pm on Nov 30, 2011 (gmt 0)

Do you by any chance know why in GWMT, some of the links displayed in the "Crawl errors" are in blue and most of them are in black?

Don't smack me upside the head for this but... Are those possibly visited links? Mine are more of a purple color.

enigma1

4:53 pm on Nov 30, 2011 (gmt 0)

It's a signal that there "used to be" a document at the location requested.

That's right and if you see my comment on the first page of this thread:

The website exposes specific content so as the owner or webmaster you expect every single request to your website to be related with the content you expose isn't it?

Following common sense and normal operations you expect the request to be for a previously existing page. So an attempt to place the user on a relevant page is your best bet. What is the real difference between the two headers other than completely losing previously existing traffic with 404 on the page?

I'd also be willing to set up a test page of 1,000 URIs (301>200)
pointed to your site with my choice of path names and anchor text. Are you that sure that there would be "zero effect"? ;)

From me you have the go ahead, I can put in a formal email if you wish. I get at least a thousand of invalid requests daily of which I believe I channel some traffic to my advantage. All the figures I have access to show very low consumption of resources mainly because of the 301s.

g1smd

7:38 pm on Nov 30, 2011 (gmt 0)

Do you by any chance know why in GWMT, some of the links displayed in the "Crawl errors" are in blue and most of them are in black?

I thought there was some significance, then realised that it appears to simply be visited/not visited.

phranque

8:19 pm on Nov 30, 2011 (gmt 0)

Where do you get this information?

the server access log file.

netmeg

8:32 pm on Nov 30, 2011 (gmt 0)

@brix76 - I *strongly* suggest you listen to pageoneresults and g1smd. They are giving you the best practice on this.

lucy24

10:41 pm on Nov 30, 2011 (gmt 0)

301 > 200 != 200

In the specific case of google's verification file, 301 > 200 means that you have a with/without www redirect in place. One of the two should get a 301. I've never got a g### request for a nonexistent file-- possibly they only do it for grownup sites ;) --but they always ask for both forms of the domain name. Bing, otoh, consistently asks for the without-www form even though they have been expressly told to use with-www.

brix76

10:19 am on Dec 1, 2011 (gmt 0)

@netmeg - I relized that yesterday. :)

@g1smd - Is that visited be me or by outside visitors?

phranque

10:47 am on Dec 1, 2011 (gmt 0)

Is that visited be me or by outside visitors?

visited by you - using that browser.
this is typically accomplished in css using anchor-pseudo-classes:
http://www.w3.org/TR/CSS1/#anchor-pseudo-classes

MichaelBluejay

8:44 pm on Dec 2, 2011 (gmt 0)

Okay, I'm not following all this, but what's wrong with using "ErrorDocument 404 siteindex.html"? That returns a proper 404 for the requested page (the one that doesn't exist), and then sends the user to the site index so they can find what they want.

aakk9999

1:42 am on Dec 9, 2011 (gmt 0)

@MichaelBluejay

Okay, I'm not following all this, but what's wrong with using "ErrorDocument 404 siteindex.html"? That returns a proper 404 for the requested page (the one that doesn't exist), and then sends the user to the site index so they can find what they want.

This is not entirely true. It does not "return 404 and then sends the visitor to siteindex.html".

What it does is returns 404 together with the content of the response that is taken from siteindex.html

When the server returns the response to a request, you get back headers and content.
Headers (amongst other things) specify HTTP response code, which would be 404.
The content of the response would be taken from siteindex.html

So the .htaccess directive
ErrorDocument 404 siteindex.html

basically tells the server:
If the document does not exist, return the response code 404 and in the content part of this response code place the content found in the file siteindex.html

If you watch address bar of the URL that is returned with such response, you will see that the url has NOT changed into siteindex.html, instead it remained to be whatever the non-existing URL you have requested. So there were no redirect to siteindex.html, instead the siteindex.html is just used as a source of where to get the content part of the 404 response.

@enigma1
With regards to you not losing the visitors and if you *really* want a visitor to land on a "home page", what you *could* do is to return 404 response in headers, with the content being the same or similar to your home page (if you are using ErrorDocument then this "home page" lookalike would have to be pure html rather than generating the page dynamically). This way you would achieve both: visitors would land on what looks like a home page and you would return 404 in one hit.

The simplest way to achieve this is to do "view source" of your home page, save it as pure html document under some different name (e.g. showhomepagewhen404.html), trim the unecesary bits (if any) and then place this document on the server. Then you change your .htaccess to have ErrorDocument 404 showhomepagewhen404.html

Personally, I would still create a nice html document which would have the same navigation as the site has and that would tell the visitor "The page you searched for was not found. Use the menu navigation to navigate the site...." or similar.

phranque

5:22 am on Dec 9, 2011 (gmt 0)

(if you are using ErrorDocument then this "home page" lookalike would have to be pure html rather than generating the page dynamically)

i would agree that using a static html document for the ErrorDocument is a best practice but it is not a requirement and in fact apache's documentation specifies a perl script as the example document for the 404 error:
http://httpd.apache.org/docs/2.0/mod/core.html#errordocument

ErrorDocument 404 /cgi-bin/bad_urls.pl

enigma1

10:12 am on Dec 9, 2011 (gmt 0)

is to return 404 response in headers, with the content being the same or similar to your home page

Sorry I am not going to return 404 under no circumstances. The only case a 404 is returned is because of bugs with some apache versions that don't pass control to the application and they stick a 404 out; happens with some crafted hack attempts although I don't care, if they manage to hack the host is a different issue.

I can do exactly the same thing with 301 which not only puts the visitor to the page of my choice but also transfers some ranking to the redirected page, it's good for SEs and you can't transfer rank with 404.

And if you suggest to differentiate pages than need 404 because there is no equivalent, my response is basically why complicating the code since the 301 serves both cases, brings the same results and it's much simpler to implement. I have yet to see a valid argument.

I outlined my points earlier on, I don't know if you got a chance to read them.

aakk9999

10:23 am on Dec 9, 2011 (gmt 0)

@phranque Thanks, did not know this
@enigma1 Yes I have read all your posts.

enigma1

11:26 am on Dec 9, 2011 (gmt 0)

These are basic protocols that have been established for many years

Also about these protocols, they do not account for application specifics. We can go on with wording arguments. From the 404 description you posted:

404.

No indication is given of whether the condition is temporary or permanent.

So you signal to a search engine, could be temporary therefore feel free not to drop it, try again later.

410.

Clients with link editing capabilities SHOULD delete references to the Request-URI after user approval.

This one signals to SEs if you approve it, you can drop it.

301.

The requested resource has been assigned a new permanent URI and any future references to this resource SHOULD use one of the returned URIs. Clients with link editing capabilities ought to automatically re-link references to the Request-URI to one or more of the new references returned by the server, where possible. This response is cacheable unless indicated otherwise.

That states exactly what I want.

Also my point is not that 404 or 410 is useless, its part of the spec and has its uses. If your site has just few physical pages and no dynamic request processing you don't want to flood the server scripts like htaccess with endless conditions and redirects. But if your application can process the request in a dynamic manner, there is no question in my mind what header to return.

g1smd

12:13 pm on Dec 9, 2011 (gmt 0)

I'll say it again. Using a 301 redirect for all non-valid URLs is a signal of low technical quality and essentially gives a site infinite URL space.

Google requests random URLs to test the response, and is expecting a 404 (or a www/non-www redirect to the same path and then a 404).

enigma1

2:47 pm on Dec 9, 2011 (gmt 0)

Ok I can say something new then. In the same token infinite 404s create infinite error pages which implies very low technical quality too. And I have explained why all this and the infinite redirects doesn't apply.

Google requests random URLs to test the response, and is expecting a 404

There is no documentation or reference that backs this up. Most likely the bot follows whatever you throw at it, therefore anybody can force 404s on your domain (just because you return 404) by setting invalid links to it. And the bot may keep retrying 404s especially if you had pages in the past listed that are now gone.

And 404 isn't necessarily permanent, that's however is backed up by the w3 spec at least.

g1smd

3:36 pm on Dec 9, 2011 (gmt 0)

In the same token infinite 404s create infinite error pages which implies very low technical quality too.

No. The 404 status says it doesn't exist. The 301 status says it does exist and has moved. That's an important difference.

enigma1

4:00 pm on Dec 9, 2011 (gmt 0)

The 301 status says it does exist and has moved

No it says it no longer exists at the specified address and moved to a different one. The 404 can be temporary, see for yourself.

No indication is given of whether the condition is temporary or permanent.

pageoneresults

4:03 pm on Dec 9, 2011 (gmt 0)

Do 404s hurt my site?
[GoogleWebmasterCentral.BlogSpot.com...]

enigma1, the above is right from the horses mouth. What you're doing is incorrect and has the potential to harm your site overall (and your clients if you've implemented this on their sites). Continue doing so at your own risk.

There is no documentation or reference that backs this up.

There are hundreds of references to this online. In fact, John Mu has commented and confirmed this in the Google Help Forums. Why are we even discussing this again? ;)

[edited by: pageoneresults at 4:22 pm (utc) on Dec 9, 2011]

This 110 message thread spans 4 pages: 110