Forum Moderators: phranque
I've been searching for a decent solution to this and haven't come up with anything conclusive. I'd be grateful if you could help me out.
I have a pretty standard htaccess with:
# with category
RewriteRule ^([^/]+)/([^/]+)\.html$ index.php?cat_name=$1&page_name=$2 [L]
# without category
RewriteRule ^([^/]+)\.html$ index.php?page_name=$1 [L]
This bit works fine, except when there's ampersand(s) in the url - common issue I know. I found some solutions and managed to get this for dealing with up to 3 ampersands in the page name. It's basically reencoding ampersands to %26:
#3 ampersands
RewriteRule ^([^&/]+)&([^&/]+)&([^&/]+)&([^&/]+)\.html$ $1\%26$2\%26$3\%26$4\.html
#2 ampersands
RewriteRule ^([^&/]+)&([^&/]+)&([^&/]+)\.html$ $1\%26$2\%26$3\.html
#1 ampersand
RewriteRule ^([^&/]+)&([^&/]+)\.html$ $1\%26$2\.html
This all seems to be working well. Now I want to escape ampersands in the category name. I've got this so far for 1 ampersand in the category:
#1 ampersand
RewriteRule ^([^&/]+)&([^&/]+)/([^/]+)\.html$ $1\%26$2/$3\.html
With this I get a '500 Internal Server Error'. I've tried checking the server logs and couldn't find anything relating to that query. I'm not very experienced with mod_rewrite.
Does anyone know what's up with my code or whether there's a better way to deal with ampersands? If I could find a single rewrite string that would just change all ampersands in the url to %26, I'd be happy!
I notice that you have not used the [L] flag on any of your rules. In general, best practice is to use [L] on *all* rules unless you know a good reason why you cannot do so. However, since mod_rewrite in .htaccess is recursive, even that won't help with this problem, which is that you've probably got an 'infinite' loop -- check your server error log to confirm, as it will likely tell you exactly what is wrong.
What's probably happening is that the URL arrives from the client with encoded ampersands. The RewriteRule un-encodes these and compares to your [^&]&[^&] pattern, which matches. So the ampersands are replaced with %26. If any rewrite is executed in .htaccess, mod_rewrite restarts, in order to check, for example, if the new URL is subject to any access restrictions or further rewriting. So now the RewriteRule un-encodes the %26 tokens to ampersands, replaces them, and the whole thing repeats until the server gives up -- again, check your error log.
So the problem is that what you want to do --replace %26 with %26-- makes no sense, and the server is trying to do it anyway.
Additionally, your URLs with embedded ampersands violate the best-practices recommendations in RFC2396 [faqs.org], and I would consider moving away from that approach and sticking to Web conventions. An ampersand is a "reserved character" according to the spec, and its use is restricted. Use a-z, A-Z, 0-9, -, _, and "." in URLs, and nothing else. This will reduce the "URL complications" in your life, and lead to inner peace and joy. ;)
Note that query strings are not part of a URL. They are data attached to a URL, to be passed to the resource at that URL. If the above did not make sense, re-read with that concept in mind; I was addressing only the URL itself, and not the attached query string. Query strings are allowed to use additional characters not allowed in URLs.
Jim
"What's probably happening is that the URL arrives from the client with encoded ampersands. The RewriteRule un-encodes these and compares to your [^&]&[^&] pattern, which matches. So the ampersands are replaced with %26. If any rewrite is executed in .htaccess, mod_rewrite restarts, in order to check, for example, if the new URL is subject to any access restrictions or further rewriting. So now the RewriteRule un-encodes the %26 tokens to ampersands, replaces them, and the whole thing repeats until the server gives up -- again, check your error log."
if this is the case I don't see why the set of rules for page_name fixes the problem as surely the same thing should happen? It's only when I try to change it on a cat_name (cat_name/page_name.html) that it causes an error. I can't find anything in the error logs relating to this either, only other errors I've had whilst trying stuff out.
"Additionally, your URLs with embedded ampersands violate the best-practices recommendations in RFC2396, and I would consider moving away from that approach and sticking to Web conventions. An ampersand is a "reserved character" according to the spec, and its use is restricted. Use a-z, A-Z, 0-9, -, _, and "." in URLs, and nothing else. This will reduce the "URL complications" in your life, and lead to inner peace and joy. ;)"
This definitely does make sense though. I was looking for a quick fix, but maybe I should go back and look at this. Only problem is, if I don't have ampersands I'll end up with duplicate urls for 'bob&smith' and 'bob smith' if I happened to have 2 pages named similarly like this. Unlikely, but I don't like to leave these things to chance. I could ban ampersands from page and category names altogether but they're very popular for short names so this is certainly not preferable.
# Rewrite to fix multiply-encoded ampersands in URLs
#
# Three ampersands
RewriteRule ^([^&]+)&([^&]+)&([^&]+)&([^&.]+)\.html$ $1\%26$2\%26$3\%26$4.html [NE,S=2]
#
# Two ampersands
RewriteRule ^([^&]+)&([^&]+)&([^&.]+)\.html$ $1\%26$2\%26$3.html [NE,S=1]
#
# One ampersand
RewriteRule ^([^&]+)&([^&.]+)\.html$ $1\%26$2.html [NE]
#
# Now rewrite to index.php script
#
# Rewrite to index.php script with category
RewriteRule ^([^/]+)/([^/.]+)\.html$ index.php?cat_name=$1&page_name=$2 [L]
#
# Rewrite to index.php script without category
RewriteRule ^([^/.]+)\.html$ index.php?page_name=$1 [L]
Jim
I might go with your first suggestion and remove all reserved chars from the url and modify the search to the database. I can't think of any occasions where I'd have got duplicate urls in the past so would be pretty safe.
I'll see if your modified htaccess code works too cos I can use that in unusual cases where duplicate urls would be an issue.