Forum Moderators: phranque
I have a large e-commerce site built on PHP Codeigniter framework. I have the following in my current .htaccess file:
RewriteEngine On
RewriteBase /RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.*)$ index.php/$1 [L]
#several lines of standard 301 redirects
RewriteEngine On
RewriteCond %{HTTP_HOST} !^www.example.com$ [NC]
RewriteRule ^(.*)$ http://www.example.com/$1 [R,L]
There are some more issues I would like to resolve with .htaccess, unfortunately some of it becomes tricky because Codeigniter has its own set of routing rules, and some quirks with mod_rewrite -- to start with:
1.) If you exclude "www." from the request, it redirects to the domain followed by index.php, which I do not want. For example:
- If you request example.com/category/gifts
You get redirected to: http://www.example.com/index.php/category/gifts
How can I get it to redirect to:
http://www.example.com/category/gifts ?
2.) What about enforcing lower case?
3.) Is it important to force a trailing slash at the end of the URL?
4.) Is there a standard set of htaccess rules anywhere that helps to prevent duplicate canonical url problems, e.g. when the same page can be shown with differences in the url?
Any help would be greatly appreciated!
2) Enforcing lowercase is the server default behaviour. More accurately, the requested URL-path must exactly match the casing of the filename, otherwise you'll get a 404 error. You can 'correct' casing errors, but it is slow and *very* inefficient in .htaccess, since each incorrect-case character must be corrected one-at-a-time.
3) A trailing slash on the end of a URL means that the requested resource is a directory, and not a file or a "page."
4) A standard set of rules is impossible, because of differences in the "correct" URL-set used by each site.
Some common URL errors often corrected using 301 redirects are:
5) Literal periods in regular-expressions patterns must be escaped. A period standing alone in a regex pattern is a token meaning "one of any character." Therefore, a pattern of ^www.example.com$ will match "wwwxexample.com", "www3example.com", and many others -- probably not what you want.
6) You should remove the [NC] from your domain canonicalization rule, so that it will not "approve" incorrectly-cased domains, and will instead redirect them.
7) If your domain has a unique (non-shared) IP address, you should add a RewriteCond to your hostname canonicalization rule so that it will accept a blank HTTP_HOST header. These are sent by true HTTP/1.0 clients, and though rare, should be accommodated. If you do not accommodate them, the result will be an infinite redirection loop, since HTTP/1.0 clients cannot send the correct HTTP_HOST header. Add this to prevent the problem:
RewriteCond %{HTTP_HOST} .
Sites hosted on shared name-based virtual servers need not implement this work-around; Name-based virtual hosts are unreachable via HTTP/1.0
Jim
Another problem is that the search engines will "pick" one URL -- whichever one they perceive as having the most merit based on their algorithm, and this may or may not be the URL that you'd prefer.
I've left the word "penalty" in the title of this thread, since that is what the problem is commonly called. But as stated, duplicate content "penalties" are usually self-inflicted ranking-dilution problems; Sloppy terminology, hyperbole, and uninformed fear have promoted a technical problem to dark and scary "penalty" status.
Jim
As far as the case-sensitivity goes, here is one of those situations where the Codeigniter framework comes into play. The following url is actually dynamic, and does not point to a file name:
http://www.example.com/category/gift-baskets
The last segment in the url is actually a unique identifier which is used in a SQL query, so it will provide the same page if it is upper, lower, or mixed case. So, Google shows some sites linking back to:
http://www.example.com/category/Gift-Baskets
The worrisome thing is, Google separately shows the lower case URL, and a different number of sites linking back to the capitalized URL...
I use .htaccess to pre-process URL requests and reject those that do not conform to the correct naming conventions for the site.
Mostly that is in checking query-string parameter names and their allowed range of values, but also includes checking the domain name .co.uk vs. .com, correct "www" sub-domain, and so on.
It would be fairly easy to set up your system so that it simply rejected any URL with an upper-case path or filename within (be aware that domain names are NOT case-sensitive) using a few lines of code in the .htaccess file.
It would be a LOT more difficult to correct those requests to have the correct case using .htaccess but it would be fairly easy to have this check at the beginning of your script. The script would check the requested URL, and issue a redirect for anything wrongly cased. Only correctly-cased requests would be dealt with by the database-facing code. There must be a section of code that evaluates requests for validity, and stripping out requests that could be security breaches or hacks, so the code for this would be added to that section of the script.
RewriteCond %{HTTP_HOST} .
There is a single period in that pattern -- don't omit it.
To clarify, are you saying it should be this?
RewriteCond %{HTTP_HOST} . !^www.example.com$
I can only get the canonical redirect to work without the period.
RewriteCond %{HTTP_HOST} !^www.example.com$
It works, and am not experiencing any problems, what could be the cause?
Entire rewrite section of my .htaccess. Works fine as posted, but removing the comment causes a server error. I have this on multiple servers, different configs and ISP's.
#RewriteCond %{HTTP_HOST} .
RewriteCond %{HTTP_HOST} !^www\.example\.com
RewriteRule (.*) http://www.example.com/$1 [R=301,L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.*)$ /cgi-bin/product-search.cgi [L]
http://www.example.com/index.php/category/gifts
to
http://www.example.com/category/gifts
So that the string "index.php" is removed from the url, but everything before and after it is the same...
The URL in the link, defines the URL of the page it links to.
.
Is the content on the server stored at:
/index.php/category/gifts ?
Do you want your URLs to be indexed like this:
www.example.com/category/gifts ?
If so, then you will need to have:
- links on your pages that point to URLs like:
www.example.com/category/gifts
- a 301 redirect:
from: www.example.com/index.php/category/gifts
to: www.example.com/category/gifts
in order to stop the Duplicate Content URLs being indexed.
- an internal rewrite so that URL requests for:
www.example.com/category/gifts
pull the content from this internal filepath:
/index.php/category/gifts or whatever internal filepath with parameters it is really located at.
You might also need another 301 redirect from a URL something like:
from: www.example.com/index.php?category=gifts
to: www.example.com/category/gifts
in order to stop direct accesses to the script itself.
.
[edit]Fixed typos in examples[/edit]
[edited by: g1smd at 5:02 pm (utc) on July 9, 2008]
Because of the Codeigniter framework, there is no such directory as index.php/category etc...Codeigniter has an SEO-friendly url system which is translated into query strings. So this url:
/category/gift-baskets
is interpreted by the web aplication as:
index.php?category=gift-baskets
There is already one rewrite rule which helps to have shorter url's which omit index.php:
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.*)$ index.php/$1 [L]
Which is standard practice in all Codeigniter applications. My problem is, the URL still works when index.php is included, which is a risk because there are two different url's which can produce the same content...so, if somehow a URL is accidentally indexed with "index.php" in there, I would like it to force the URL without it...
Oddly enough, we have not published any URL's that contain index.php, but somehow Google's bot has found a way to index a few of them that way!
So, I am looking for a few lines to put into htaccess that will accomplish this, if anyone knows it would be a huge help...
[edited by: codeman at 4:55 pm (utc) on July 9, 2008]
Fix the links on your pages, then follow g1smd's procedure.
Jim
I think something was miscommunicated - all of the links on our site correctly do NOT include index.php, and appear how we wish them to be indexed.
Somehow, at some point, probably during the site's development, the mysterious Googlebot picked up a few that contain index.php. There is already the Rewrite in place above so that the web application will use index.php even if it is not in the url - what I need to do is redirect any url's containing index.php to the same URL without it, I'm alreday up to that point where I just don't know how to write this in mod_rewrite...
Yes, it looks like a simple redirect will fix the last of your issues, but make sure that it is using a RewriteRule and not Redirect otherwise you cannot guarantee the order in which it runs.
I think you might be overestimating me! Based on my beginner-level knowledge of regular expressions and htaccess syntax, I came up with this, but it does not work (although at least it is not harming anything, which is more than what I usually accomplish):
RewriteEngine On
RewriteCond %{HTTP_HOST} ^www.example.com$ [NC]
RewriteRule /index.php/^(.*)$ /$1 [R,L]
RewriteCond %{HTTP_HOST} ^(www.)?example\.com/index\.php/$ [NC]
RewriteRule ^(.+)/$ [%{HTTP_HOST}...] [R=301,L]
Nearly!
^ goes at the front, the rule cannot see the leading slash, and [R] gives a 302 redirect. The redirect should also force the domain.
This is closer:
RewriteRule ^index.php/(.*) http://www.domain.com/$1? [R=301,L]
I also like to add a ? to the end of the target, so that unnecessary parameters are removed from the final URL.
.
*** RewriteCond %{HTTP_HOST} ^www.example.com$ [NC] ***
Why have you got a RewriteCond that only runs the rule if a www URL was requested?
Surely you want it to run for both www and non-www and force both over to www at the same time.
That line can be omitted.
I tried this line:
RewriteRule ^index.php/(.*) [domain.com...] [R=301,L]
Unfortunately, it's conflicting with something else...it causes an error on the site that the server is redirecting in a way that will never complete...
It could also be something specific to CodeIgniter's url re-routing...