Forum Moderators: phranque
This, being:
We have a website. Right now it's in html files. In the future, it could be in php files, it may be in asp files. Indeed, in the near future we may include php files into the site as well...
At current we have urls like this:
www.example.com/about-us.html
www.example.com/contact-us.html
www.example.com/news.html
You get the idea. In the future, those files may become contact-us.php
They may become news.php, as we may create a dynamic news instead of constantly updating html.
So, using rewrites - i'd like to make it so that ALL url's are:
www.example.com/news
www.example.com/about-us
www.example.com/contact-us
Now that should be fairly simple, although it was a pain to find the info out.
My problem is that you can simply stick the file extension on the end and view the file (all html at the moment, so not tested on php yet).
type in www.example.com/news.html - and it shows you the news file.
The same as if you had gone to www.example.com/news
I'd like it if when you added the extension, you got an error - therefore completely masking (supposedly) what the underlying file system is from the user.
Is this possible?
Furthermore, I've found reading the rules on mod rewriting to be damned hard work. I am not even positive what we currently have is correct, although it seems to work (so you can enter the url with, or without, the file extension to get to see the required file).
I am asking for some help please, as after spending nearly 2 hours searching through groups on google, various websites proclaiming to have the answer, and other locations where everyone's solution is different... I've started to "lose the plot" as they say. Surely it's not that complicated that 2 hours should be spent trying to understand! I must be doing something wrong.
Sorry for the lengthy post, hopefully it hasn't put folks off from helping if they can!
Here is the current .htaccess (no access to the conf file at this time), as I say - I don't know if this is right or wrong at this time - it's currently supposed to only be for html files..
RewriteEngine On
RewriteCond %{REQUEST_FILENAME}!-d
RewriteCond %{REQUEST_FILENAME}!-f
RewriteRule ^(.*)\.html$ $1 [L,QSA]
I don't actually understand what's going on there - as far as I can see it's any filename.html to just filename?
Thanks for your time.
Working through it as a functional process rather than as an implementation plan, here's what's needed:
First, remove the ".html" extensions from all of the links on all of your pages -- Get incoming links from other sites updated too, if possible.
Then add some mod_rewrite code to internally rewrite extensionless *URL* requests to .html *files*, if those files exist.
Finally, add some mod_rewrite code to externally redirect any direct client (user or robot) requests for .html-extension URLs to the corresponding extensionless URLs.
In actual implementation, you'd reverse the first two steps to avoid any downtime on your site, but it helps conceptually to start at your pages --which is where your URLs are defined for the world-- and work through an imaginary click on a link, a URL request to the server, translation of the URL to a filename through mod_rewrite, and then into the server filesystem.
So the code would look something like this:
RewriteEngine on
#
## Internally rewrite extensionless file requests to .html files ##
#
# If the requested URI does not contain a period in the final path-part
RewriteCond %{REQUEST_URI} !(\.[^./]+)$
# and if it does not exist as a directory
RewriteCond %{REQUEST_FILENAME} !-d
# and if it does not exist as a file
RewriteCond %{REQUEST_FILENAME} !-f
# then add .html to get the actual filename
RewriteRule (.*) /$1.html [L]
#
#
## Externally redirect clients directly requesting .html page URIs to extensionless URIs
#
# If client request header contains html file extension
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^.]+\.)+html\ HTTP
# externally redirect to extensionless URI
RewriteRule ^(.+)\.html$ http://www.example.com/$1 [R=301,L]
To clarify the redirect section of the code, THE_REQUEST contains the request header sent by the client, and might typically look like this:
GET /page.html HTTP/1.0or
GET /shop.php?item=4329 HTTP/1.1
Things get more complicated when you have multiple types of real files, such as html *and* php. In that case, you must check to see whether each kind of real file exists before rewriting the extensionless URL request to it. And if both an html and a php file exist, the order of tests in your code will determine which filetype will always take priority.
Jim
[edited] See discussion below [/edited]
[edited by: jdMorgan at 5:16 am (utc) on June 20, 2007]
Without file extensions, your server won't have any way to determine the proper MIME-type header to send with its responses to client requests, and this will lead to all sorts of secondary problems.
Jim
I'd like it if when you added the extension, you got an error - therefore completely masking (supposedly) what the underlying file system is from the user.Is this possible?
Surely it's not that complicated that 2 hours should be spent trying to understand! I must be doing something wrong.
`The great thing about mod_rewrite is it gives you all the configurability and flexibility of Sendmail. The downside to mod_rewrite is that it gives you all the configurability and flexibility of Sendmail.''
-- Brian Behlendorf
Apache Group
Re the above rules,
# If client request header contains html file extension
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^.]+\.)+html\ HTTP
# externally redirect to extensionless URI
RewriteRule ^(([^.]+\.)+)html$ http://www.example.com/$1
I am thinking the rewritten uri will have a trailing dot because the dot before "html" will be included in the pattern, right?
So can we do this instead:
# If client request header contains html file extension
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /(.+)\.html\ HTTP
# externally redirect to extensionless URI
RewriteRule ^(.+)\.html$ http://www.example.com/$1
I am still trying to learn this stuff.
Sometimes I try too hard to write "generalized" code... A good compromise might be:
# If client request header contains html file extension
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^.]+\.)+html\ HTTP
# externally redirect to extensionless URI
RewriteRule ^(.+)\.html$ http://www.example.com/$1 [R=301,L]
# If client request header contains html file extension
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^.]+)\.html\ HTTP
# externally redirect to extensionless URI
RewriteRule ^([^.]+)\.html$ http://www.example.com/$1 [R=301,L]
I will correct my post above so that my error doesn't propagate through "impatient copying" and lead to subsequent "support" requests here.
Jim
You were right, I did have the concept backwards. To add confusion to it, I hadn't removed extensions from the files - i understood the fact that the files will still have extensions (especially php due to source not being readable whilst server interprets the file, but only interpreted as the extension identifies that it should be!), and all my links did have extensions removed from them... but when it came to apply logic to realise that what actually happened was Apache taking an extensionless url, and looking for a matching filename with a given extension (html in this case) to serve up in response, my mind and logic switched off... probably why i found it so damned confusing to find what i wanted to achieve.
Of course, that's if I have got it right now... otherwise, grr!
Thank you a lot for the help, it's very much appreciated. Explains also why i had so much trouble finding answers, asking the wrong question - typical :)
Thanks to everyone's contributions too, big help :)
I am trying to understand why we have to test for a dot in the file path at all. All we care about is that the path ends in ".html". What goes before that is immaterial. Is that so?
if you are asking why this
^([^.]+)\.html$rather than this
^(.*)\.html$the answer is that [^.]+ is far more efficient than the "ambiguous, greedy and promiscuous" .*
However, like many things with mod_rewrite, it often comes down to style. I intend to go dig back through the source code and check to make sure that there actually *is* an efficiency improvement when single forward-looking negative-match end-anchored patterns are used. I'm sure there is an advantage in patterns with multiple subpatterns -- an example of a very-inefficient pattern being "^(.*)/(.*)/(.*)$" -- but having been questioned about this particular case, I'm not sure there's a big advantage (or even any) in a single-subpattern end-anchored pattern.
In some cases, the performance advantage of one pattern against another depends on whether the "head" of the matched URL-path is longer than the "tail," and in the case of single-pattern end-anchored short-tailed matches, the advantage may actually go to the "^(.+)\.html$" style of pattern.
I'm struggling to find descriptive terms here; In this case, the "head" is the part of the URL-path matched to "(.+)", while the "tail" is ".html". These are not "official terms" -- As far as I know, I just invented them (but probably not).
Basically, you have to look at the regex source code and follow the matching process step-by-step, character-by-character while counting matching-attempt passes in order to make this determination.
Other than diverting the original poster's thread, these are good questions and help everybody learn -- including your humble moderator... :)
Jim
Well, I'm glad this helped. Your question and situation are actually one or two steps more complex than the average "point of confusion", so don't feel bad -- It's far better to be confused by a complex problem than by a simple one, and rewriting/redirecting to attain extensionless URLs is not a basic task.
Jim
But my question had not really been answered. Jim said above:
The code is slightly complicated to allow for URIs which contain dots in any but the final path-part. For example, /my.stuff/page will be rewritten to /my.stuff/page.html.
So is it neccessary to check for dots in the file path other than the last one?
So
RewriteEngine on
#
## Internally rewrite extensionless file requests to .html files ##
#
# If the requested URI does not contain a period in the final path-part
RewriteCond %{REQUEST_URI}!(\.[^./]+)$
# and if it does not exist as a directory
RewriteCond %{REQUEST_FILENAME}!-d
# and if it does not exist as a file
RewriteCond %{REQUEST_FILENAME}!-f
# then add .html to get the actual filename
RewriteRule (.*) /$1.html [L]## Externally redirect clients directly requesting .html page URIs to extensionless URIs
#
# If client request header contains html file extension
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^.]+\.)+html\ HTTP
# externally redirect to extensionless URI
RewriteRule ^(.+)\.html$ http://www.example.com/$1 [R=301,L]
This is all working perfectly, but now the reason why I originally wanted this. The site is static html pages throughout most of it. However, turns out we now need some lovely functionality put into it - so one or two pages are going php...
If I am right in thinking - then this part:
## Externally redirect clients directly requesting .html page URIs to extensionless URIs
#
# If client request header contains html file extension
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^.]+\.)+html\ HTTP
# externally redirect to extensionless URI
RewriteRule ^(.+)\.html$ http://www.example.com/$1 [R=301,L]
can be extended so that external requests to a php file are redirected to extensionless URI
## Externally redirect clients directly requesting .html page URIs to extensionless URIs
#
# If client request header contains html file extension
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^.]+\.)+html\ HTTP
# externally redirect to extensionless URI
RewriteRule ^(.+)\.html$ http://www.example.com/$1 [R=301,L]# If client request header contains php file extension
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^.]+\.)+php\ HTTP
# externally redirect to extensionless URI
RewriteRule ^(.+)\.php$ http://www.example.com/$1 [R=301,L]
But the problem comes internally afaik (assuming that's right above)
# If the requested URI does not contain a period in the final path-part
RewriteCond %{REQUEST_URI}!(\.[^./]+)$
# and if it does not exist as a directory
RewriteCond %{REQUEST_FILENAME}!-d
# and if it does not exist as a file
RewriteCond %{REQUEST_FILENAME}!-f
# then add .html to get the actual filename
RewriteRule (.*) /$1.html [L]
If it doesn't contain an extension, then what should happen (realisticially) is that it checks for an existing file of type html, and if it can't find one, then add .php to get the actual filename...
This is where I get stuck - again. I'm guessing (looked through all apache documentation, cannot find any reference to what [L], [R] or any other square bracket capitals mean! Assuming L is leave, and R is redirect (leave being stop matching).
I'm sure i've got this 100% wrong, but I thought i'd put down what I tried to do.
# If the requested URI does not contain a period in the final path-part
RewriteCond %{REQUEST_URI}!(\.[^./]+)$
# and if it does not exist as a directory
RewriteCond %{REQUEST_FILENAME}!-d
# and if it does not exist as a file
RewriteCond %{REQUEST_FILENAME}!-f
# and if it does not exist as a html file
RewriteCond %{REQUEST_FILENAME}.html!-f
# then add .php to get the actual filename
RewriteRule (.*) /$1.php [L]#Otherwise If the requested URI does not contain a period in the final path-part
RewriteCond %{REQUEST_URI}!(\.[^./]+)$
# and if it does not exist as a directory
RewriteCond %{REQUEST_FILENAME}!-d
# and if it does not exist as a file
RewriteCond %{REQUEST_FILENAME}!-f
# then add .html to get the actual filename
RewriteRule (.*) /$1.html [L]
Is this even remotely right? The idea is that if the "page" they are trying to view isn't a html file, then it must be php instead. If there is a html file though of that namen - then serve up said html file instead.
I've had, in truth, ZERO luck testing it - but I'm not sure if this is to do with my host... The last htaccess file was downloaded to make the changes on, and found to only contain the bloody password protected directory settings only - however, the rules still work fine for going to (for example) http://www.example.com/information and seeing the contents of the file at http://www.example.com/information.html
Naturally, i am making it so that if you go to http://www.example.com/information and there is no information.html file there - then it should be the information.php file instead.
I hope thats clear... i'm definitely not a fan of these things yet :P
Any help/guidance, or pointers would be much appreciated.
Your solution should work, but the redundant file-exists checks and other tests are inefficient, and may lead to an early server upgrade being required. Logically, they aren't necessary.
The solution lies in finding those "mystery flags" -- They are documented in the RewriteCond and RewriteRule sections of the Apache mod_rewrite documentation [httpd.apache.org]. Do yourself a favor; Print out the mod_rewrite docs, and read them repeatedly over several weeks... Maybe just a section at a time when you're sitting in a place that echoes and is decorated with a lot of white porcelain. This will help a lot; Even if you don't remember the details, you'll remember that the docs say something about the problem at hand, and you'll know where to look...
I'd suggest a reversal of the logic that will allow you to 'share' some of the code between the four possibilities of "not extensionless URL", "exists as php", "exists as html" and "doesn't exist". In addition, this approach leaves your error log showing the originally-requested URL-path in the case that the requested URL doesn't exist at all -- which may save you some major confusion in the future (and at a time when you can't afford confusion, because you'll be trying to fix some other problem). It is also easily extensible:
# If the requested URL contains a period in the final path-part
RewriteCond %{REQUEST_URI} (\.[^./]+)$ [OR]
# Or if it exists as a directory
RewriteCond %{REQUEST_FILENAME} -d [OR]
# Or if it exists as a file
RewriteCond %{REQUEST_FILENAME} -f [OR]
# Then leave URL alone and [b]s[/b]kip the next two rules
RewriteRule .* - [S=2]
#
# Extensionless URL does not resolve to an existing
# file or directory, so try it as php and html
#
# If requested extensionless URL exists as .php
RewriteCond %{REQUEST_FILENAME}.php -f
# then add .php to get the actual filename
RewriteRule (.+) /$1.php [L]
#
# Else if requested extensionless URL exists as .html
RewriteCond %{REQUEST_FILENAME}.html -f
# then add .php to get the actual filename
RewriteRule (.+) /$1.html [L]
#
# The "Skip" function from the first rule lands here; Execution
# will resume on the following line. We will also land here if
# the extensionless file does not exist as .php or as .html.
Jim
[edited by: jdMorgan at 3:12 pm (utc) on July 4, 2007]
I have a similar situation but the details vary and none of the rules suggested here, work for me. Here is my setup:
Old site, over 1,000 pages, mostly .htm and some .html extensions. These pages are linked to, so no broken links allowed by changing the file extenstion from [?] to .php.
As I see it, I need (at least) three rules:
- If filename (less extention) does not exist, do nothing.
Allow default 404 page to display.
- If filename with .php extension exists, display .php file
- If filename with .htm or .html extension exists, rewrite the
extension to .php and display the .php file.
I've looked at mod_rewrite, also the alias and symlink functions but I think mod_rewrite is the way to go for my needs.
Could someone please help with the RewriteRules?
SumDumGuy
I tried that, it didn't work. For much the same reason the rewrite rules I tried didn't work either: because the host apparently has some other rule in place to limit and/or ignore .htaccess files that have been arbitrarily installed by the clients. So I tried adding filetype handlers to an .htaccess file on another server I have access to, on the other server it worked like a charm.
I'll need to take the issue up with the Tech Support people at the one host. Meanwhile, I'm keeping tabs on this forum, there are lots of people to learn from here!
I've been rethinking whether to conceal that PHP is installed from users at large, there is much to consider and solid reasoning for keeping that bit of information out of the spotlight. So I may end up wanting to conceal file extensions too, time will tell.
SDG
Problem is that it's not finding that php page, inspite of the htaccess file looking/pointing correctly for it.
Currently trying to raise some support from the provider to find out why this is not working, considering the htaccess is correct!
If you add the extension, it removes it just fine (html or php).
(eg www.example.com/about-us.php becomes www.example.com/about-us in the browser nav bar).
If the file exists as a html file - it serves it up.
(eg www.example.com/about-us.html exists, then www.example.com/about-us shows the html page).
If it exists as a php file only though, it just gives you an error page - despite the extensionless url actually pointing to the php file (ie http://www.example.com/contact-us shows an error page, whilst http://www.example.com/contact-us.php is an actual file!).
Hoping to resolve it shortly, if the host/provider gets back to us that is! Thanks again for the help, and I'll have to read up more on the subject as time allows ;)
Apache 2.x uses an internal priority scheme, so this won't apply.
Jim