Display URI's in the SERPs

Forum Moderators: open

Message Too Old, No Replies

Display URI's in the SERPs

Google vs Yahoo! vs MSN

pageoneresults

4:52 pm on Jan 13, 2007 (gmt 0)

Google
http://www.example.com/sub/

Yahoo!
http://www.example.com/sub

MSN Live
http://www.example.com/sub

Anyone see a problem with the above?

inbound

5:34 pm on Jan 13, 2007 (gmt 0)

We see this all the time but, as P1R states in the title, it's the display URL rather than the click URL.

It's a problem that means we always employ rewrite rules, just to be on the safe side. We prefer the trailing slash to be the correct URL for the page, with a permanent redirect on the other.

It's annoying that search engines should choose to display your URL in a way that can cause issues.

I guess they do it for readability - not expecting people to type it in (and possibly get an error on a system that's not expecting it).

pageoneresults

7:01 pm on Jan 13, 2007 (gmt 0)

It's annoying that search engines should choose to display your URL in a way that can cause issues.

Let's discuss those issues further. What could happen due to this URI Display issue? Think scrapers, cut and paste, etc.

inbound

9:05 pm on Jan 13, 2007 (gmt 0)

Straight away you can say that scrapers could take the wrong URL and end up creating duplicate content for domains that serve both versions as the same page.

Let's remember that this discussion should probably be centred on the skills of the average site owner (let's use site owner rather than webmaster as the average person who publishes on the web is not the type of person that frequents WebmasterWorld).

The decision by BIG companies to use a system that could easily break URL's is scandalous.

The web should be a place where knowledge of a niche should be enough to rank well, not that in combination with knowledge of how to set up a server.

I know that many people here, including myself, profit from people being unaware of how to rank well. But we are talking about a basic error in SE's approach that WILL cause issues.

Robert Charlton

10:41 pm on Jan 13, 2007 (gmt 0)

It's a problem that means we always employ rewrite rules, just to be on the safe side. We prefer the trailing slash to be the correct URL for the page, with a permanent redirect on the other.

Just something I've wanted to note for a while...

[webmasterworld.com...]
...takes you to the WebmasterWorld Google forum.

[webmasterworld.com...]
...gives you a 404.

Straight away you can say that scrapers could take the wrong URL and end up creating duplicate content for domains that serve both versions as the same page.

I remember that I used to see a lot of URIs in serps that had spaces in them that weren't in the originals. Haven't noticed these for a while and can't come up with an example now.

Assuming the engines have thought about how they're returning their URI displays, why are they doing it the way they are? And, for that matter, why isn't WebmasterWorld redirecting its forum URIs?

pageoneresults

4:18 pm on Jan 14, 2007 (gmt 0)

I've come across rewrite implementations where the trailing forward slash was not taken into consideration. I've noticed on Windows that when you implement a rewrite routine using the httpd.ini file, you break the Windows default routine of automatically appending a trailing forward slash to URIs that point to a root level page. I caught on to this years ago myself.

What some developers fail to do is Hack the URI and make sure that the proper response is returned as you take away from the URI. You just back your way up from the final destination.

1. http://www.example.com/sub/
2. http://www.example.com/sub
3. http://www.example.com/su
4. http://www.example.com/s

Number 1 serves a 200.
Number 2 serves a 301 to Number 1.
Number 3 serves a 404
Number 4 serves a 404

I have to be careful with Number 2. If a server is set up to use Content Negotiation, then I may not take Number 2 and 301 it to Number 1. Now that I've gone extensionless, Number 2 is a valid URI.

So, this is where a potential issue comes in. Google has it right. Yahoo! and MSN I believe may be causing harm to site owners due to the way they display incomplete URI's.

Did I wake up with a tin-foil hat on this morning or what?

[edited by: pageoneresults at 9:05 pm (utc) on Jan. 15, 2007]

coopster

4:01 pm on Jan 15, 2007 (gmt 0)

I have to be careful with Number 2. If a server is set up to use Content Negotiation, then I may not take Number 2 and 301 it to Number 1. Now that I've gone extensionless, Number 2 is a valid URI.

Then Content Negotiation has not been setup properly. Content Negotiation, when properly configured, will return a 404 for Number 2. That is, if there really is not a resource at that URI.

And, for that matter, why isn't WebmasterWorld redirecting its forum URIs?

Why? [webmasterworld.com]

mcavic

8:51 pm on Jan 15, 2007 (gmt 0)

The web server should always do a 301 redirect if an ending slash is proper. Apache does this automatically. I think IIS doesn't, but it should, because it's wrong for a missing ending slash to break the url.

Also, a search engine should always obey a redirect by listing the corrected URL. But Yahoo has never been good with redirects anyway, and I wouldn't expect MSN to be much better than Yahoo.

So in summary, Apache and Google have it right, which is what I always expect.

g1smd

10:08 pm on Jan 15, 2007 (gmt 0)

The click URL is the most important, but I do see problems with the "incorrect" "slashess" URL being displayed. I always 301 redirect from URL to URL/ for the domain root and folders.

coopster

10:41 pm on Jan 15, 2007 (gmt 0)

Yes, I'm sorry, I dropped the wrong status code there. It is a 301, not a 404.

<added>
Best that I mention that I'm referring specifically to Apache here, configured with Content Negotiation and redirection as mcavic mentioned
</added>

macdave

1:41 am on Jan 16, 2007 (gmt 0)

For quite some time (and as recently as a year or so ago) Yahoo was also stripping the trailing URL from the click URL. They don't seem to be doing it anymore, but for a long time it served only to inflate their referrer stats in my logs...

ronburk

9:18 pm on Jan 16, 2007 (gmt 0)

because it's wrong for a missing ending slash to break the url.

This makes it sound like you believe that some specification requires that:

http://www.example.com/resource/

and

http://www.example.com/resource

have to refer to the same resource.

The RFC for URIs [gbiv.com] clearly says that's not a requirement (and specifically says it's not appropriate for a web spider to assume the two refer to the same resource unless it actually gets told that by the given web server).

Since I'm free to decide that, on my web server, http://www.example.com/resource/ is a valid resource and that http://www.example.com/resource is not, I can't see in what sense the word wrong applies. The "missing" slash in this case does not "break the URL", it simply changes the resource specifier into one that does not refer to a valid resource.

[edited by: pageoneresults at 9:25 pm (utc) on Jan. 16, 2007]
[edit reason] Examplified URI References [/edit]

pageoneresults

9:50 pm on Jan 16, 2007 (gmt 0)

For quite some time (and as recently as a year or so ago) Yahoo was also stripping the trailing URL from the click URL.

I just did a few searches to see what I could uncover in this instance. I found a few Click URIs that were without a trailing forward slash. Fortunately the server where those sites reside handled it correctly and redirected to the trailing forward slash.

There are way too many sites out there, particularly on Windows Servers that don't handle this correctly and me thinks it could be a potential issue.

mcavic

12:47 am on Jan 17, 2007 (gmt 0)

This makes it sound like you believe ... have to refer to the same resource.

Logically, they should. Why would you want the two to point to different valid content?

it's not appropriate for a web spider to assume the two refer to the same resource

Ideally I agree, except that if the spider didn't assume that, and if the Web server returned success for both urls, the resulting duplicate content would probably amount to at least 20% of the SE's index.

I think it's valid in this case to say that over 99% of users and Webmasters expect "content" to be the same as "content/", and that Web servers and SE's can handle that so as to cause no problems for those 99%.

pageoneresults

12:51 am on Jan 17, 2007 (gmt 0)

www.example.com/sub and www.example.com/sub/ are two distinct URIs. One is a root level page in a sub directory and the other is an extensionless page at the root.

I think it's valid in this case to say that over 99% of users and Webmasters expect "content" to be the same as "content/",

I hope not as that would be a mistake.

And that Web servers and SE's can handle that so as to cause no problems for those 99%.

Hmmm, well there goes all the work I'm getting ready to do in moving to an extensionless environment also referred to as Content Negotiation. ;)

mcavic

1:13 am on Jan 17, 2007 (gmt 0)

I can see wanting an extensionless environment. But the problem is that most people, when typing a url, will drop the ending slash.

And even if the slashes are always correct and always obeyed, urls can still be ambiguous if you allow the ending slash to have meaning.

For example, what if you have a url like:
example.com/content/parameter
Where "content" is a script. Then suppose someone wants to pass a blank parameter. What's the correct url?

coopster

8:42 pm on Jan 17, 2007 (gmt 0)

You can still have the extensionless environment, you can still use Content Negotiation. How you handle your links is another matter. I would concur that the average practice is to 301 redirect a resource without a trailing slash as opposed to having a resource for each distinctly. Apache does this on their site and Microsoft does it on their site too. Visit either site and find a resource link that is apparently a directory. Visit the link. Clear the cache in your browser and watch the headers returned when you visit the same exact link minus the trailing slash. It is a 301 redirect to the resource with the trailing slash now appended.

When a resource is requested without a trailing slash, and that resource is a content-negotiated script that would normally return a 200 OK response, you might want to check for that and handle accordingly. I tend to follow the default action of my http server and 301 redirect to the resource with a trailing slash first. If you choose not to, that is your prerogative.

Although the "Why?" link that I referred to earlier discusses case-sensitivity the same discussion could be applied here in regards to the resource requested having a trailing slash or not. If WebmasterWorld decides that they don't need to worry about the trailing slash and return a 404, so be it. If there is a fear that type-in traffic, inbound links, etc. are going to cause some form of loss, then a decision must be made and the server configured accordingly to accommodate. Either return a 200 OK and a new resource or 301 redirect to the intended location.