Forum Moderators: Robert Charlton & goodroi
URLs not followed
When we tested a sample of the URLs from your Sitemap, we found that some URLs were not accessible to Googlebot because they contained too many redirects. Please change the URLs in your Sitemap that redirect and replace them with the destination URL (the redirect target). All valid URLs will still be submitted
The page in question currently returns the following:
#1 Server Response: HTTP Status Code: HTTP/1.1 301 Moved Permanently#2 Server Response: HTTP Status Code: HTTP/1.1 200 OK
I have three questions:
1.) Should the page(s) in question be removed from the sitemap?
2.) Is this error normal following the addition of a custom .404 page?
3.) If not, how do I fix this error?
Notice that your server never actually gave a 404 response? As time goes by, every bad url that search engines find will be 301 redirected to the same custom "error" message that comes with a 200 OK header. The end result will be thousands of valid urls with duplicate content.
Thank you both - it's now fixed!
#1 Server Response: HTTP Status Code: HTTP/1.1 404 Not Found
I have one other related question. Which is the best way to deal with non-existent pages:
1.) A simple .404.
2.) A .404 followed by a redirect to the homepage
3.) A .404 followed by a redirect to a custom .404 page
I usually use the custom message page - that way the visitor knows they asked for a problematic url, and the home page redirect doesn't give them that kind of feedback. The custom page can give helpful choices to the visitor, and the standard error message doesn't do that.
I would never redirect to the home page. That just doesn't seem like a good idea at all. The 404 should directly show your customised page at the URL "that doesn't exist".
To be clear, a redirect involves the browser making a new HTTP request for a different URL to the one it originally requested.
You should be looking at the HTTP status code in the HTTP header to see what is really going on.
.
Never rely on the fact that a user might see www.domain.com/404.html in their browser URL bar after making a request for a URL that does not exist.
In many cases, that "404" page will have returned a 200 OK status code in the HTTP Header, and you would have got there through the server previously issuing a 302 redirect when the URL you requested wasn't found.
In that case you do not have a 404 error page, you have a system for getting your "error" (sic) page indexed under an infinite number of URLs.
Only if the requested URL returns a 404 status code in the HTTP Header have you truly got yourself a proper 404 error page. What you put on that page for the human visitor to read is entirely up to you, but the bot only looks as far as the HTTP STATUS: 404 line in the HTTP Header to find out what is going on.
[edited by: g1smd at 1:52 pm (utc) on Sep. 12, 2008]
#1 Server Response: [example...]
HTTP Status Code: HTTP/1.1 404 Not Found
Date: Fri, 12 Sep 2008 14:33:07 GMT
Server: Apache/2.2.8 (Fedora)
Accept-Ranges: bytes
Content-Length: 4200
Connection: close
Content-Type: text/html
I never like to hear the words "directed to" when we are talking about 404 errors, as to me that implies that there is an extra step happening between the request for some URL, and the response with the error message being displayed. There is no such extra step, unless the browser is being redirected, and such a redirection is unnecessary, unwanted, and will cause problems.
Can you confirm (using Live HTTP Headers or somesuch) that the very first thing that comes back from the server after your request is sent to it, is a HTTP header that includes these words or something similar: HTTP Status: 404 Not Found?
Sorry to labour the point, but it is very misunderstood, and I am writing the extra detail for anyone else that reads this thread way into the future...
[edited by: g1smd at 2:11 pm (utc) on Sep. 12, 2008]
One common error when using Apache is to use a full URL including the domain name to define the 404 error page (like www.domain.com/error404.html or somesuch).
If you do that, the server will send a 302 status code whenever a page is not found on the server. That configuration error will cause you a LOT of problems. This behavior is highlighted in the Apache documentation, but widely overlooked or ignored.
The correct implementation specifies only the local filepath starting with a / and counting from the root of the current domain (like ErrorDocument 404 /errors/error404.html or somesuch).
Can I clarify one further thing: When the visitor is presented with the .404 custom page, should the URL in the browser change from http://www.example.com/example-page.html to http://www.example.com/404page.html .. or should the original (unavailable) URL be displayed?
[edited by: tedster at 2:40 am (utc) on Sep. 15, 2008]
[edit reason] fix example urls [/edit]
As noted above by g1smd, a simple error (or misunderstanding) when defining your ErrorDocument can cause the server to generate a 302 redirect rather than the correct error response code. This behavior is documented in the Apache ErrorDocument documentation.
Wrong: ErrorDocument 404 http://www.example.com/path-to-404-error-document.html
Right: ErrorDocument 404 /path-to-404-error-document.html
The first ErrorDocument directive above, which includes "http://www.example.com," will result in a 302 redirect response to the client. This is the single most common cause of search engines seeing problems with server error response codes on Apache servers.
Jim