Forum Moderators: phranque

Message Too Old, No Replies

Trying to remove file extensions for site

mod rewrite rules, can't find how

         

mikeg

11:01 am on Jun 19, 2007 (gmt 0)

10+ Year Member



First of all, it's my first post here. I have spent quite a considerable amount of hair pulling time out on the net, searching for how to do this.

This, being:
We have a website. Right now it's in html files. In the future, it could be in php files, it may be in asp files. Indeed, in the near future we may include php files into the site as well...

At current we have urls like this:
www.example.com/about-us.html
www.example.com/contact-us.html
www.example.com/news.html

You get the idea. In the future, those files may become contact-us.php
They may become news.php, as we may create a dynamic news instead of constantly updating html.

So, using rewrites - i'd like to make it so that ALL url's are:
www.example.com/news
www.example.com/about-us
www.example.com/contact-us

Now that should be fairly simple, although it was a pain to find the info out.
My problem is that you can simply stick the file extension on the end and view the file (all html at the moment, so not tested on php yet).
type in www.example.com/news.html - and it shows you the news file.
The same as if you had gone to www.example.com/news

I'd like it if when you added the extension, you got an error - therefore completely masking (supposedly) what the underlying file system is from the user.

Is this possible?
Furthermore, I've found reading the rules on mod rewriting to be damned hard work. I am not even positive what we currently have is correct, although it seems to work (so you can enter the url with, or without, the file extension to get to see the required file).

I am asking for some help please, as after spending nearly 2 hours searching through groups on google, various websites proclaiming to have the answer, and other locations where everyone's solution is different... I've started to "lose the plot" as they say. Surely it's not that complicated that 2 hours should be spent trying to understand! I must be doing something wrong.

Sorry for the lengthy post, hopefully it hasn't put folks off from helping if they can!

Here is the current .htaccess (no access to the conf file at this time), as I say - I don't know if this is right or wrong at this time - it's currently supposed to only be for html files..

RewriteEngine On
RewriteCond %{REQUEST_FILENAME}!-d
RewriteCond %{REQUEST_FILENAME}!-f
RewriteRule ^(.*)\.html$ $1 [L,QSA]

I don't actually understand what's going on there - as far as I can see it's any filename.html to just filename?

Thanks for your time.

jdMorgan

2:21 pm on Jun 19, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Looking at this code, I'm afraid you may have the underlying concept backwards.

Working through it as a functional process rather than as an implementation plan, here's what's needed:

First, remove the ".html" extensions from all of the links on all of your pages -- Get incoming links from other sites updated too, if possible.
Then add some mod_rewrite code to internally rewrite extensionless *URL* requests to .html *files*, if those files exist.
Finally, add some mod_rewrite code to externally redirect any direct client (user or robot) requests for .html-extension URLs to the corresponding extensionless URLs.

In actual implementation, you'd reverse the first two steps to avoid any downtime on your site, but it helps conceptually to start at your pages --which is where your URLs are defined for the world-- and work through an imaginary click on a link, a URL request to the server, translation of the URL to a filename through mod_rewrite, and then into the server filesystem.

So the code would look something like this:


RewriteEngine on
#
## Internally rewrite extensionless file requests to .html files ##
#
# If the requested URI does not contain a period in the final path-part
RewriteCond %{REQUEST_URI} !(\.[^./]+)$
# and if it does not exist as a directory
RewriteCond %{REQUEST_FILENAME} !-d
# and if it does not exist as a file
RewriteCond %{REQUEST_FILENAME} !-f
# then add .html to get the actual filename
RewriteRule (.*) /$1.html [L]
#
#
## Externally redirect clients directly requesting .html page URIs to extensionless URIs
#
# If client request header contains html file extension
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^.]+\.)+html\ HTTP
# externally redirect to extensionless URI
RewriteRule ^(.+)\.html$ http://www.example.com/$1 [R=301,L]

The code is slightly complicated to allow for URIs which contain dots in any but the final path-part. For example, /my.stuff/page will be rewritten to /my.stuff/page.html.

To clarify the redirect section of the code, THE_REQUEST contains the request header sent by the client, and might typically look like this:

GET /page.html HTTP/1.0
or
GET /shop.php?item=4329 HTTP/1.1

Things get more complicated when you have multiple types of real files, such as html *and* php. In that case, you must check to see whether each kind of real file exists before rewriting the extensionless URL request to it. And if both an html and a php file exist, the order of tests in your code will determine which filetype will always take priority.

Jim

[edited] See discussion below [/edited]

[edited by: jdMorgan at 5:16 am (utc) on June 20, 2007]

jdMorgan

2:41 pm on Jun 19, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



To further clarify, files should/must always have extensions, it is extensionless URLs that you are seeking.

Without file extensions, your server won't have any way to determine the proper MIME-type header to send with its responses to client requests, and this will lead to all sorts of secondary problems.

Jim

phranque

11:16 pm on Jun 19, 2007 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



welcome to WebmasterWorld, mikeg!

I'd like it if when you added the extension, you got an error - therefore completely masking (supposedly) what the underlying file system is from the user.

Is this possible?


possible but not necessary or friendly.
as in jim's solution you externally rewrite if necessary to the extensionless url and internally rewrite if necessary to the correct extension.

Surely it's not that complicated that 2 hours should be spent trying to understand! I must be doing something wrong.

actually it is that complicated.
you could read for 2 days and not get it all.
the problem is:
`The great thing about mod_rewrite is it gives you all the configurability and flexibility of Sendmail. The downside to mod_rewrite is that it gives you all the configurability and flexibility of Sendmail.''
-- Brian Behlendorf
Apache Group

sc112

11:57 pm on Jun 19, 2007 (gmt 0)

10+ Year Member



Hi, Jim. First time posting in this forum. I have learnt so much from your helpful posts and tutorials. Much thanks.

Re the above rules,

# If client request header contains html file extension
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^.]+\.)+html\ HTTP
# externally redirect to extensionless URI
RewriteRule ^(([^.]+\.)+)html$ http://www.example.com/$1

I am thinking the rewritten uri will have a trailing dot because the dot before "html" will be included in the pattern, right?

So can we do this instead:

# If client request header contains html file extension
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /(.+)\.html\ HTTP
# externally redirect to extensionless URI
RewriteRule ^(.+)\.html$ http://www.example.com/$1

I am still trying to learn this stuff.

jdMorgan

5:19 am on Jun 20, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well-spotted! You're quite right -- I dashed that off and messed it up. I also omitted the [R=301,L] on the end!

Sometimes I try too hard to write "generalized" code... A good compromise might be:


# If client request header contains html file extension
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^.]+\.)+html\ HTTP
# externally redirect to extensionless URI
RewriteRule ^(.+)\.html$ http://www.example.com/$1 [R=301,L]

But if there are never any periods in the site's directory names, the following would be faster:

# If client request header contains html file extension
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^.]+)\.html\ HTTP
# externally redirect to extensionless URI
RewriteRule ^([^.]+)\.html$ http://www.example.com/$1 [R=301,L]

I hope I haven't injected another error... :o

I will correct my post above so that my error doesn't propagate through "impatient copying" and lead to subsequent "support" requests here.

Jim

sc112

12:14 pm on Jun 20, 2007 (gmt 0)

10+ Year Member



Thanks Jim for the clarification.

I am trying to understand why we have to test for a dot in the file path at all. All we care about is that the path ends in ".html". What goes before that is immaterial. Is that so?

mikeg

2:34 pm on Jun 20, 2007 (gmt 0)

10+ Year Member



jdMorgan, thank you very much for the informative post. Very helpful, and miles clearer than anything I have found so far!

You were right, I did have the concept backwards. To add confusion to it, I hadn't removed extensions from the files - i understood the fact that the files will still have extensions (especially php due to source not being readable whilst server interprets the file, but only interpreted as the extension identifies that it should be!), and all my links did have extensions removed from them... but when it came to apply logic to realise that what actually happened was Apache taking an extensionless url, and looking for a matching filename with a given extension (html in this case) to serve up in response, my mind and logic switched off... probably why i found it so damned confusing to find what i wanted to achieve.

Of course, that's if I have got it right now... otherwise, grr!

Thank you a lot for the help, it's very much appreciated. Explains also why i had so much trouble finding answers, asking the wrong question - typical :)
Thanks to everyone's contributions too, big help :)

phranque

8:33 pm on Jun 20, 2007 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I am trying to understand why we have to test for a dot in the file path at all. All we care about is that the path ends in ".html". What goes before that is immaterial. Is that so?

if you are asking why this

^([^.]+)\.html$
rather than this
^(.*)\.html$
the answer is that [^.]+ is far more efficient than the "ambiguous, greedy and promiscuous" .*

jdMorgan

9:26 pm on Jun 20, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes, the idea is to allow a straight left-to-right evaluation of the URL-path against the pattern. We're telling the regex matching engine, "Just match everything into the first subpattern until you find a period, then start matching against the rest of the pattern." It avoids "back-off-from-the-end-and-retry" matching attempts.

However, like many things with mod_rewrite, it often comes down to style. I intend to go dig back through the source code and check to make sure that there actually *is* an efficiency improvement when single forward-looking negative-match end-anchored patterns are used. I'm sure there is an advantage in patterns with multiple subpatterns -- an example of a very-inefficient pattern being "^(.*)/(.*)/(.*)$" -- but having been questioned about this particular case, I'm not sure there's a big advantage (or even any) in a single-subpattern end-anchored pattern.

In some cases, the performance advantage of one pattern against another depends on whether the "head" of the matched URL-path is longer than the "tail," and in the case of single-pattern end-anchored short-tailed matches, the advantage may actually go to the "^(.+)\.html$" style of pattern.

I'm struggling to find descriptive terms here; In this case, the "head" is the part of the URL-path matched to "(.+)", while the "tail" is ".html". These are not "official terms" -- As far as I know, I just invented them (but probably not).

Basically, you have to look at the regex source code and follow the matching process step-by-step, character-by-character while counting matching-attempt passes in order to make this determination.

Other than diverting the original poster's thread, these are good questions and help everybody learn -- including your humble moderator... :)

Jim

jdMorgan

9:29 pm on Jun 20, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



mikeg,

Well, I'm glad this helped. Your question and situation are actually one or two steps more complex than the average "point of confusion", so don't feel bad -- It's far better to be confused by a complex problem than by a simple one, and rewriting/redirecting to attain extensionless URLs is not a basic task.

Jim

sc112

12:38 am on Jun 21, 2007 (gmt 0)

10+ Year Member



Thanks Jim and phranque for the lesson on efficiency. Jim had talked about the importance of efficient pattern matching in his tutorial. I will remember it more now.

But my question had not really been answered. Jim said above:

The code is slightly complicated to allow for URIs which contain dots in any but the final path-part. For example, /my.stuff/page will be rewritten to /my.stuff/page.html.

So is it neccessary to check for dots in the file path other than the last one?

jdMorgan

1:22 am on Jun 21, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



From my last post, I would say, "If the "tail" of the URL is long, yes, and if it's short, I'm not sure. It is purely an efficiency thing, to avoid repeated "back-off-and-try-again" loops in the regex matching engine.

Jim

sc112

2:10 am on Jun 21, 2007 (gmt 0)

10+ Year Member



Got it. Thanks. Believe it or not, this rewrite stuff is beginning to look like fun!

mikeg

1:09 pm on Jul 4, 2007 (gmt 0)

10+ Year Member



Old bumpage, for newage questionage!

So

RewriteEngine on
#
## Internally rewrite extensionless file requests to .html files ##
#
# If the requested URI does not contain a period in the final path-part
RewriteCond %{REQUEST_URI}!(\.[^./]+)$
# and if it does not exist as a directory
RewriteCond %{REQUEST_FILENAME}!-d
# and if it does not exist as a file
RewriteCond %{REQUEST_FILENAME}!-f
# then add .html to get the actual filename
RewriteRule (.*) /$1.html [L]

## Externally redirect clients directly requesting .html page URIs to extensionless URIs
#
# If client request header contains html file extension
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^.]+\.)+html\ HTTP
# externally redirect to extensionless URI
RewriteRule ^(.+)\.html$ http://www.example.com/$1 [R=301,L]

This is all working perfectly, but now the reason why I originally wanted this. The site is static html pages throughout most of it. However, turns out we now need some lovely functionality put into it - so one or two pages are going php...

If I am right in thinking - then this part:

## Externally redirect clients directly requesting .html page URIs to extensionless URIs
#
# If client request header contains html file extension
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^.]+\.)+html\ HTTP
# externally redirect to extensionless URI
RewriteRule ^(.+)\.html$ http://www.example.com/$1 [R=301,L]

can be extended so that external requests to a php file are redirected to extensionless URI


## Externally redirect clients directly requesting .html page URIs to extensionless URIs
#
# If client request header contains html file extension
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^.]+\.)+html\ HTTP
# externally redirect to extensionless URI
RewriteRule ^(.+)\.html$ http://www.example.com/$1 [R=301,L]

# If client request header contains php file extension
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^.]+\.)+php\ HTTP
# externally redirect to extensionless URI
RewriteRule ^(.+)\.php$ http://www.example.com/$1 [R=301,L]

But the problem comes internally afaik (assuming that's right above)

# If the requested URI does not contain a period in the final path-part
RewriteCond %{REQUEST_URI}!(\.[^./]+)$
# and if it does not exist as a directory
RewriteCond %{REQUEST_FILENAME}!-d
# and if it does not exist as a file
RewriteCond %{REQUEST_FILENAME}!-f
# then add .html to get the actual filename
RewriteRule (.*) /$1.html [L]

If it doesn't contain an extension, then what should happen (realisticially) is that it checks for an existing file of type html, and if it can't find one, then add .php to get the actual filename...

This is where I get stuck - again. I'm guessing (looked through all apache documentation, cannot find any reference to what [L], [R] or any other square bracket capitals mean! Assuming L is leave, and R is redirect (leave being stop matching).

I'm sure i've got this 100% wrong, but I thought i'd put down what I tried to do.

# If the requested URI does not contain a period in the final path-part
RewriteCond %{REQUEST_URI}!(\.[^./]+)$
# and if it does not exist as a directory
RewriteCond %{REQUEST_FILENAME}!-d
# and if it does not exist as a file
RewriteCond %{REQUEST_FILENAME}!-f
# and if it does not exist as a html file
RewriteCond %{REQUEST_FILENAME}.html!-f
# then add .php to get the actual filename
RewriteRule (.*) /$1.php [L]

#Otherwise If the requested URI does not contain a period in the final path-part
RewriteCond %{REQUEST_URI}!(\.[^./]+)$
# and if it does not exist as a directory
RewriteCond %{REQUEST_FILENAME}!-d
# and if it does not exist as a file
RewriteCond %{REQUEST_FILENAME}!-f
# then add .html to get the actual filename
RewriteRule (.*) /$1.html [L]

Is this even remotely right? The idea is that if the "page" they are trying to view isn't a html file, then it must be php instead. If there is a html file though of that namen - then serve up said html file instead.

I've had, in truth, ZERO luck testing it - but I'm not sure if this is to do with my host... The last htaccess file was downloaded to make the changes on, and found to only contain the bloody password protected directory settings only - however, the rules still work fine for going to (for example) http://www.example.com/information and seeing the contents of the file at http://www.example.com/information.html
Naturally, i am making it so that if you go to http://www.example.com/information and there is no information.html file there - then it should be the information.php file instead.

I hope thats clear... i'm definitely not a fan of these things yet :P
Any help/guidance, or pointers would be much appreciated.

jdMorgan

3:11 pm on Jul 4, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Above, sc112 said he was almost finding this to be fun... I'm afraid it's a slippery slope: Start out apprehensive, begin to enjoy it a bit, and then you find yourself preferring mod_rewrite to the newspaper crossword puzzles...

Your solution should work, but the redundant file-exists checks and other tests are inefficient, and may lead to an early server upgrade being required. Logically, they aren't necessary.

The solution lies in finding those "mystery flags" -- They are documented in the RewriteCond and RewriteRule sections of the Apache mod_rewrite documentation [httpd.apache.org]. Do yourself a favor; Print out the mod_rewrite docs, and read them repeatedly over several weeks... Maybe just a section at a time when you're sitting in a place that echoes and is decorated with a lot of white porcelain. This will help a lot; Even if you don't remember the details, you'll remember that the docs say something about the problem at hand, and you'll know where to look...

I'd suggest a reversal of the logic that will allow you to 'share' some of the code between the four possibilities of "not extensionless URL", "exists as php", "exists as html" and "doesn't exist". In addition, this approach leaves your error log showing the originally-requested URL-path in the case that the requested URL doesn't exist at all -- which may save you some major confusion in the future (and at a time when you can't afford confusion, because you'll be trying to fix some other problem). It is also easily extensible:


# If the requested URL contains a period in the final path-part
RewriteCond %{REQUEST_URI} (\.[^./]+)$ [OR]
# Or if it exists as a directory
RewriteCond %{REQUEST_FILENAME} -d [OR]
# Or if it exists as a file
RewriteCond %{REQUEST_FILENAME} -f [OR]
# Then leave URL alone and [b]s[/b]kip the next two rules
RewriteRule .* - [S=2]
#
# Extensionless URL does not resolve to an existing
# file or directory, so try it as php and html
#
# If requested extensionless URL exists as .php
RewriteCond %{REQUEST_FILENAME}.php -f
# then add .php to get the actual filename
RewriteRule (.+) /$1.php [L]
#
# Else if requested extensionless URL exists as .html
RewriteCond %{REQUEST_FILENAME}.html -f
# then add .php to get the actual filename
RewriteRule (.+) /$1.html [L]
#
# The "Skip" function from the first rule lands here; Execution
# will resume on the following line. We will also land here if
# the extensionless file does not exist as .php or as .html.

Now the problem is that .html files will each have four 'exists' checks per request. Since the majority of your extensionless URLs resolve to .html files, that's not the best situation. So IF you won't have both HTML and PHP versions of the same page sitting on the server, OR if you don't care about the 'priority' of php over html if both do exist, then it would be better to swap the last two steps above, so that the .html extension is checked first, before the .php extension. This does change the 'priority' in the case that both exist, but reduces the number of file checks for the majority extensionless-URI .html files from four to three.

Jim

[edited by: jdMorgan at 3:12 pm (utc) on July 4, 2007]

SumDumGuys

3:21 pm on Jul 5, 2007 (gmt 0)

10+ Year Member



I stumbled across this board yesterday, what a lucky find!

I have a similar situation but the details vary and none of the rules suggested here, work for me. Here is my setup:

Old site, over 1,000 pages, mostly .htm and some .html extensions. These pages are linked to, so no broken links allowed by changing the file extenstion from [?] to .php.

As I see it, I need (at least) three rules:

- If filename (less extention) does not exist, do nothing.
Allow default 404 page to display.
- If filename with .php extension exists, display .php file
- If filename with .htm or .html extension exists, rewrite the
extension to .php and display the .php file.

I've looked at mod_rewrite, also the alias and symlink functions but I think mod_rewrite is the way to go for my needs.

Could someone please help with the RewriteRules?

SumDumGuy

g1smd

10:39 pm on Jul 6, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Why don't you just use AddType to tell the server to treat files and URLs that end in .html as if they contain PHP scripts and then you have no need to change any of the files, URLs, or links on your site at all?

SumDumGuys

2:15 am on Jul 7, 2007 (gmt 0)

10+ Year Member



Hi, thanks for the reply.

I tried that, it didn't work. For much the same reason the rewrite rules I tried didn't work either: because the host apparently has some other rule in place to limit and/or ignore .htaccess files that have been arbitrarily installed by the clients. So I tried adding filetype handlers to an .htaccess file on another server I have access to, on the other server it worked like a charm.

I'll need to take the issue up with the Tech Support people at the one host. Meanwhile, I'm keeping tabs on this forum, there are lots of people to learn from here!

I've been rethinking whether to conceal that PHP is installed from users at large, there is much to consider and solid reasoning for keeping that bit of information out of the spotlight. So I may end up wanting to conceal file extensions too, time will tell.

SDG

mikeg

10:49 am on Jul 11, 2007 (gmt 0)

10+ Year Member



Thanks a lot JD, you've been invaluable!

Problem is that it's not finding that php page, inspite of the htaccess file looking/pointing correctly for it.

Currently trying to raise some support from the provider to find out why this is not working, considering the htaccess is correct!

If you add the extension, it removes it just fine (html or php).
(eg www.example.com/about-us.php becomes www.example.com/about-us in the browser nav bar).

If the file exists as a html file - it serves it up.
(eg www.example.com/about-us.html exists, then www.example.com/about-us shows the html page).

If it exists as a php file only though, it just gives you an error page - despite the extensionless url actually pointing to the php file (ie http://www.example.com/contact-us shows an error page, whilst http://www.example.com/contact-us.php is an actual file!).

Hoping to resolve it shortly, if the host/provider gets back to us that is! Thanks again for the help, and I'll have to read up more on the subject as time allows ;)

jdMorgan

2:40 pm on Jul 11, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If you are on Apache 1.x, ask the host to be sure that mod_rewrite is loaded *After* PHP in the LoadModule list -- Modules are executed in REVERSE order of the LoadModule list (as documented) and many hosts get this wrong. If PHP is added as a module at the end of the LoadModule list, then it will execute before mod_rewrite, and no mod_rewrite rule will have any effect on requests that would normally resolve to PHP files.

Apache 2.x uses an internal priority scheme, so this won't apply.

Jim