Forum Moderators: Robert Charlton & goodroi
Although any server can be misconfigured to return an incorrect http status in the server header, Microsoft IIS has a particular liability here. In a nutshell:
The problem here - the server never told the user agent that the status for the original url was supposed to be 404! Instead, it just sent a 302 temporary redirect. Even if the on-page message says "404", that does not mean anything to a crawler, googlebot included.
Under this incredibly common approach to a custom error page, any bad url can be indexed according to the standard handling of an internal 302 redirect:
a) the content of the redirect's target url is indexed
b) with the original url as the location
So the bad urls can start piling up as duplicate urls for the same exact content.
The Microsoft IIS server and .NET platform are particularly vulnerable to this problem. Although the default error handling does return a correct 404 status code, the problem is with the way custom error messages are often set up.
Google does try to catch this problem with test spidering. That's one reason people see googlebot requesting strange urls. If googlebot notices this error handling problem, they generate a notice in Webmaster Tools account. Then the webmaster can't even validate the site or access reports until it gets fixed.
However, it is not wise to hand over this responsibility to Google. Get it fixed, so a 404 actually returns a 404 http status code.
By the way, there are incorrect instructions all over the place about setting up custom redirects this way on IIS. Don't believe them, not even when they occur in hard cover books with the MS seal of approval on them, not even when they occur on Microsoft's own forums, not even when they occur on blog posts of otherwise very savvy people. If the server header isn't 404, then the potential for trouble is there.
Because the details of a fix can depend on many factors, including the version of IIS, please, take up the technical how-to questions in our Microsoft ISS Server forum [webmasterworld.com]. You can use Site Search [webmasterworld.com] to find many related threads that are already published.
[edited by: tedster at 5:42 pm (utc) on April 14, 2008]
I felt it was important to focus on this issue in a dedicated thread - it's coming up much too often, both in threads here and in the sites of new clients that I evaluate.
The problem is by no means confined to IIS, either. For example, I see it on Apache servers running Tomcat for .jsp pages.
[edited by: tedster at 5:40 pm (utc) on April 14, 2008]
A troubling factor is that the web has become so easy to use, you can be a publisher of thousands of websites without understanding the machinery that makes it work. Status headers are like webmaster 101. But many publishers are driving without their WebmasterWorld diploma!
It behooves every good webmaster to make sure their headers are engineered to properly represent the content being delivered. Serving sites with muddled status headers is like driving with the parking brake on. Yeah, once in a while you'll do it. By accident. What I fear is the driver who doesn't know what that lever does, or why their car is making that squealing noise and burning smell.
WebmasterWorld offers this tool in the Control Panel:
[webmasterworld.com...]
The problem comes in when IIS is set up for the custom 404 page your correct the header doesn't throw a 404 but a 302 to a 200, so all that needs to be added is the code in the head of the page.
works fine on all url's not on the server.
<%
Response.Status = "404 Not Found"
%>
make the 404 page an asp page and the problem is fixed.
Thanks for bring this up I got a good idea from a fellow member on adding seach on the 404 page so will do that now.
I won't give details as to how I set it up but Tedester is correct follow the instructions on your IIS point it at the custom 404.asp page, add the code to the head of the page, check by entering a bad URL.
If the 404 page resolves then check the same url in a header check to make sure the status code throws a 404 and nothing but a 404.
It does this redirection before Wordpress figures out whether the page exists or not.
thus this URL:
http://www.example.com/blog/yayayayaya.html
returns a 404 error.
this one:
http://example.com/blog/yayayayaya.html
returns 301, pointing Location to the URL above with the "www", which then returns a 404.
Perfect - no. But sufficient?
As someone who claims to be an expert on handling bad requests, I'm embarassed not to know whether a chain of 301s ending with a 404 is as good as serving a 404 from the get-go. I have always assumed so.
I doubt I'll be able to sleep soundly until I've cracked open the HTTP spec to confirm this...
y'know, a few years ago, I had problems with IIS and ISAPIRewrite (I forget which version) not doing proper 301's - they were all 302, despite flagging rules with [RP]. It was a real problem. I wonder if it still is?
The oddest status code I've ever dealt with was "999", an unorthodox code returned by certain Yahoo services when you overflow your bandwidth limits. If the HTTP spec doesn't provide an adequate code, are we allowed to make up our own?
I am fine with the 301 header reqquest being it comes from the non www with the 404 as the ending page and will be able to sleep tonght.
Hey you better lay off the Dr Peppers got that finger fired up today.:)
tedster fingers the 302 as the culprit in these status header crimes... what about the venerable 301?
The indexing problems do seem to be specifically tied to the 302 redirect - that's because the parent url still gets indexed with the target url's content. A 301 redirect does not cause Google to add the originating url to the index, so no such issue ever seems to show up.
In most IIS operations, 302 is the default and you need to check an extra box to make the redirect "permanent". The interface often just calls a 302 status "a redirect" and that lack of clarity is a source of trouble. Even micrsoft.com had this issue in several areas, and finally added some vbscript to fix it.
As I mentiond earlier, Google has aware of this problem for a while and has actively worked to address it. My concern is that even though the bogus urls may no longer show up in the reporting functions, such as the site: operator, they still may be mucking things up in hidden ways.
In short, I no longer trust to site: operator to be accurate as much as I used to,and I much prefer knowing that a true 404 status is actually returned for anything that isn't there.
The canonical 301 redirect (no-www to with-www or vice versa) is what Google recommends, and I've never seen a problem from going from a 301 to a 404. Since I've been having all my clients set up the canonical fix this way for years, both on IIS and Apache, I feel safe with it.
[edited by: tedster at 3:40 am (utc) on April 15, 2008]
However, with that said, I have witnessed Google, Yahoo and MSN hit my sites with a few random junk page names just to see how your site responds to 404s and if you redirect to a custom 404 page that doesn't return the proper code.
Therefore, I would assume they'll eventually figure out the mis-configured 404 page but it may take quite a bit of time and there's really no excuse for not sending the proper error code in the first place.
TIP: You can make a script to display your friendly 404 page that puts a 404 response code in the HTTP header which works 100% and doesn't depend on any server configuration whatsoever.
[edited by: incrediBILL at 10:02 pm (utc) on April 14, 2008]
ErrorDocument 404 http://www.example.com/error404.html
This is the documented behavior [httpd.apache.org] when a full ErrorDocument URL is specified as above. The correct syntax needed to avoid this problem uses only a local URL-path:
ErrorDocument 404 /error404.html
I set it up my 404 by going into the IIS for the domain selecting the 404 page changing it to point to the 404 page name checked "permanent".
But when I checked the header was a 302 to the 404.
I had to do a bunch of looking to find the fix and it was adding the above script to return the correct header.
My sever is using IIS 6
I am not sure you can set up a good 404 from IIS. I maybe wrong but mine was set up by the book and it didn't work correctly until I added the extra code to fix it.
A 301 to 200 custom error handling would only mean that the custom error page url gets indexed once. Add a 404 script to that error page, and then it does not even get indexed once - that's what you want.
I am not sure you can set up a good 404 from IIS. I maybe wrong but mine was set up by the book and it didn't work correctly until I added the extra code to fix it.
All my IIS sites are running .NET, and they let the IIS errors remain default, which is fine. I suspect that you're right, bwnbwn. You need to manually add a script to the custom error page to get the 404 status.