Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

removing non-www to www redirect, bad idea?

         

JS_Harris

12:45 am on Mar 12, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm continually seeing redirects in my logs that are really rather pointless. I'd estimate that 99% of the redirecting from non-www to www is done for bots and not for humans. I set up the redirect a long time ago and it's never been perfect, it occasionally does things like 301 to a 404 because of it. I'm starting to think it would have been better to straight 404 the wrong version from the start.

- Google/Bing/Yahoo all index the site properly with www
- 99% of backlinks contain the www
- Virtually all real traffic comes in on the www
- But those bots, always checking and looking for stuff without it first...

It would simplify my logs and my life a good deal to simply remove the redirect and ensure 404 is returned. Is there really a valid reason given the above to keep the redirect going? If not, should it be kept in place for just the index page? It's no longer about giving my visitors a better experience, this is only affecting bots at this point.

tangor

4:08 am on Mar 12, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Which bots? In most cases bad bots get block and get a 404 anyway if you're tracking and nuking them. Your log won't change, you'll still have an entry. If it bothers you (don't recommend this) just 404 all non-www and you'll only miss 1% (your numbers)

not2easy

5:11 am on Mar 12, 2016 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



It sounds like it's the order that is less than perfect. The non-www to www should be after all other rewrite rules:
RewriteCond %{HTTP_HOST} !^(www\.example\.com)?$
RewriteRule (.*) http://www.example.com/$1 [R=301,L]



(Exception is for Wordpress sites - that WP snippet is last if it exists)

lucy24

5:55 am on Mar 12, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



it occasionally does things like 301 to a 404 because of it

How often do robots request nonexistent pages? If you've intentionally removed stuff, you'd be returning the response (whether 410 or 404) manually-- and that should happen independent of the form of the hostname.

Make sure you have a rule something like this at the beginning of all RewriteRules (here assuming Apache):
RewriteRule ^(missing|forbidden)\.html - [L]
listing all your error documents by name, with the appropriate path. (I once forgot this line on a test site, and was baffled at the number of malign agents requesting the 403 page by name. Oops. Some of them still come back periodically and ask for it.) Hostname then doesn't matter, because the request will never go any further.

I've taken to returning a manual 404 for some types of malign request-- files that genuinely don't exist, but why should the server even waste time looking when it's none of the requester's business in any case? That means they'll never get to a 301, even if they started out requesting ExAmPlE.COM for a www.example.com site, because they'll hit the 404 first. And the 404, unlike a 403, leaves them in doubt about whether I'm onto them ;)

In general, 301-to-404 means you've done something wrong.

Now, just because this week the two or three or five biggest search engines respect your express wish with regard to domain name, that doesn't necessarily mean you can rely on them to take the same policy forever. If you continue redirecting, you can be absolutely certain everyone gets it right. And what about human links that got your name wrong? Don't you want everyone to be on the same page?

Edit:
The non-www to www should be after all other rewrite rules:

It should be after all your external redirects. Internal rewrites-- including CMS business-- always come last.

JS_Harris

10:20 am on Mar 12, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The site has...

code to redirect any uri containing index.php
code to redirect any url missing the www
code to remove any url ending in .html
code to 404 all valid URLs where numbers are attached, ie: example.com/legit-page/1231231 should be example.com/legit-page
There are other redirect rules to take care of various wordpress created issues like best guessing URLs. ie: /some-pag redirecting to /some-page or /some-page/234234234 or /some-page.. etc.

Too many redirects is leading to redirect chains. If I delete a page the result is a redirect chain leading to a 404, and it's getting old.

I'm considering returning 404 for any and all URLs that are not wanted. This is in part to prepare for a switch to static on this domain to remove wordpress completely. When I do that it will be easier to simply not add re-write rules to take care of problems that no longer exist, they'll be 404 already without wordpress redirecting stuff.

I'm wondering if it would be bad to leave out the non-www to www. Would it be a problem if search engines look for example.com and get a 404 even though www.example.com exists ?

not2easy

3:36 pm on Mar 12, 2016 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



If you can reach the same page with multiple URLs that is not ideal, it is duplicate content. If existing pages return a 404, they can fall from the index.

I once wondered why one new topic folder that started out so well just sort of slacked off to nothing. In a routine check I saw that that folder did not have the www rewrite which is not inherited from root. I fixed that and suddenly within about 10 days, it became quite active again. Coincidence? It is one of the things I check first now when performance is unexpectedly poor.

WP rewrites can be tricky and are best handled from within Wordpress (imo) to prevent chains and unintended results. If you would like to have your htaccess file examined for best practices, the Apache Forum here: [webmasterworld.com...] has excellent help.

lucy24

7:48 pm on Mar 12, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Too many redirects is leading to redirect chains.

Well, that's not www's fault. It may be time to overhaul the order of your RewriteRules; everything can and should happen in a single step.

I'm considering returning 404 for any and all URLs that are not wanted.

A 410 is better, if you're talking about URLs that used to exist and no longer do. You can show humans the same ErrorDocument that you use for 404 if you like, but 410 makes the googlebot go away faster.

code to redirect any uri containing index.php
code to redirect any url missing the www
code to remove any url ending in .html
code to 404 all valid URLs where numbers are attached, ie: example.com/legit-page/1231231 should be example.com/legit-page

If the legit-page URL is deducible from the request, why the 404? That's a simple redirect. You don't need to test whether /legit-page itself is a valid URL, because if someone asks for a wholly bogus page, they deserve a 301-to-404 sequence (the only time this would happen). If there are specific URLs that you've deleted, omit the closing anchor from the rule, so the same 410 is returned whether or not there's trailing doodah with numbers.

The optimal order is:

/defunct-page(/123123)? >> 410
/legit-page/1231231 >> www.example.com/legit-page
/blahblah/index.php >> www.example.com/blahblah/
/blahblah.html >> www.example.com/blahblah
example.com/blahblah >> www.example.com/blahblah

Each redirect target includes the full protocol-plus-domain, so the only time the final www redirect will deploy is when the request was correct in all other ways. Some rules will require one or more RewriteCond and/or tweaking of the pattern. But that's for the apache subforum.

Incidentally, you should also redirect requests for /index.html even if html doesn't exist on your site. Search engines will ask for them as a matter of routine, under the general head of Entrapment, and the server shouldn't have to waste time looking for them.

keyplyr

10:43 am on Mar 13, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@JS_Harris - you may want to contact your server admin to see if they offer an account setting that forwards. That way you could remove all the 301s from htaccess.

JS_Harris

4:06 pm on Mar 13, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I did, they don't.

Thanks for that Lucy but this is wordpress. Simply adding a period after a URL is enough to create a new page with duplicate content. The htaccess file would be absolutely huge to work out all real URIs from all of the wordpress mistakes.

The latest version of wordpress not only has a ton of less than ideal redirects happening (/sample-page resolves with /sam....ge !?!) but it also has a LOT of header junk most webmasters do not want or need. I'm removing wordpress from sites with excellent results lately(speed and this other stuff).

To be honest I do not miss wordpress's issues that have existed for many years without being fixed(duplicate URLs appended with replytocom etc)

I swear that blackhat SEO's love seeing competitors using wordpress lately, they know they can knock off top ranked pages that are wordpress based with a little negative SEO to other URLs that generate the same content. Canonical catches only a fraction of the possible URLs that resolve to the same content. Heck, add a number to the end of any url (ie: /sample-page/12345) and the canonical built into wordpress makes that the canonical url. ?!?

My proposed fix is to stop the behavior of redirecting willy nilly completely and let it all 404 moving forward. Again, my biggest concern is doing this with the index page. Is there a problem with www.example.com working and example.com having a 404 ? It seems like it may be problematic on the base URL.

not2easy

5:07 pm on Mar 13, 2016 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



To iron out duplicate content in WordPress it can be as simple as using a good plugin. I have long recommended Yoast's SEO plugin because it is fairly simple to use. I would never try to deal with WP URL issues in htaccess.

Google understands that WP has various ways to find content. The key is to choose which URLs you want in the index and noindex the rest of the variations. Only serve the sitemaps you want indexed (also part of Yoast's plugin). Use the Settings to achieve the URL syntax/structure you want because whatever you set in there is what WP will create. If you have URLs like example.com/legit-page/1231231 that should be example.com/legit-page it is because that is what is in the Settings. You can't control the output in htaccess if the Settings are configured to use example.com/legit-page/1231231

Fix the Settings, then use a plugin to noindex the /category/legit-page/, /archives/legit-page/, /tags/legit-page/ and other versions of the same URLs. IF you have entered the www version in the WP settings you only need it in your htaccess file if/when you have other content in addition to WP - assuming that WP is installed in the root/public directory. Just in case some of those 301s were added to the WP htaccess snippet - the piece of htaccess code generated by installing WP should never be edited to correct problems configured in your Settings.

lucy24

12:12 am on Mar 14, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



not2easy, does there exist a WP plugin that handles domain-name canonicalization?

If not, I (belatedly) understand the problem: whether this redirect is handled by your host or in your own htaccess doesn't matter; either way it will happen before the WP business, and therefore it will happen twice if the request ends up being anything other than a 200. :(

If you have any pages that exist as physical files-- say, a directory of archived material that you don't build on the fly-- you could make a rule listing them by name:
RewriteCond %{HTTP_HOST} !^www\.example\.com$
RewriteRule ^(real-name|other-real-name|third-real-name) http://www.example.com/$1 [R=301,L]

And then for non-page files-- which should never be allowed as far as the WP envelope anyway, unless you've got something unusual going on-- you can say
RewriteCond %{HTTP_HOST} !^www\.example\.com$
RewriteCond %{REQUEST_URI} ^(.+)
RewriteRule \.(css|js|png|gif|jpg) http://www.example.com/%1 [R=301,L]
But how often do people request images by the wrong hostname? Do search engines make a habit of it?

Officially it should be (www\.example\.com)? in option-parentheses both places, but I don't believe there exist legitimate visitors who don't send the Host: header. I've never even seen an illegitimate one fail to do so.

not2easy

1:00 am on Mar 14, 2016 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



It isn't a plugin, the domain URL is part of the configuration settings. If you configure the WP install with settings for www.example.com, that is where it will claim to be installed, even if it is installed in some other directory (such as www.example.com/blog/).

An example: If you add https: to a site with WP and don't change the Settings file to show https: it will go to a 500 after attempting to find where it is configured to be while the htaccess tries to use https. The settings determine the URL syntax, you can't force it with rewrites if the settings say something different because those /legit-page/ URLs don't exist.

The WordPress Codex has complete details with full explanations for all settings. There is a screenshot of the Admin Settings panel here: [codex.wordpress.org...] and it explains each setting in detail.

The URL syntax is entered in the Permalinks panel which is shown there after other settings to handle media, discussions and users.

Ralph_Slate

4:15 pm on Mar 15, 2016 (gmt 0)

10+ Year Member Top Contributors Of The Month



but 410 makes the googlebot go away faster.


I have set a number of pages to 410 in 2013. I still see these pages in my WMT 404 screen even though the only links to them is from a page that is also 410'd (I had a "calendar" issue with a lot of incorrect dynamic pages being "created" inadvertently via bad recursive date links)

JS_Harris

12:03 am on Mar 16, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I see pages that haven't existed since 2006 appearing in WMT. I went ahead and let all non-www pages return 404. Same with any request with index.php in the mix or ending in .html.

zero redirects, life is good(and bots are bouncing already).