Forum Moderators: phranque

Message Too Old, No Replies

Drupal .htaccess file

Let's optimise it for speed and efficiency, and fix a few bugs

         

g1smd

8:37 pm on Oct 16, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



We are getting more and more enquiries in this forum about the code found in the default .htaccess files of common CMS, blog, cart and forum packages.

There have been posts about Wordpress, Joomla, ZenCart, osCommerce, phpBB, Drupal, and others in recent months. Many of those .htaccess files contain buggy and inefficient code, code that barely works, and in many cases works so inefficiently that the user will be forced into an early server upgrade once the site becomes moderately busy.

A recent post about Drupal [webmasterworld.com...] caused me to take a look at the default code in that package. From CVS, the current Drupal .htaccess file version looks like this:

#
# Apache/PHP/Drupal settings:
#

# Protect files and directories from prying eyes.
<FilesMatch "\.(engine|inc|info|install|make|module|profile|test|po|sh|.*sql|th
eme|tpl(\.php)?|xtmpl)$|^(\..*|Entries.*|Repository|Root|Tag|Template)$">
Order allow,deny
</FilesMatch>

# Don't show directory listings for URLs which map to a directory.
Options -Indexes

# Follow symbolic links in this directory.
Options +FollowSymLinks

# Make Drupal handle any 404 errors.
ErrorDocument 404 /index.php

# Force simple error message for requests for non-existent favicon.ico.
<Files favicon.ico>
# There is no end quote below, for compatibility with Apache 1.3.
ErrorDocument 404 "The requested file favicon.ico was not found.
</Files>

# Set the default handler.
DirectoryIndex index.php index.html index.htm

# Override PHP settings that cannot be changed at runtime. See
# sites/default/default.settings.php and drupal_initialize_variables() in
# includes/bootstrap.inc for settings that can be changed at runtime.

# PHP 5, Apache 1 and 2.
<IfModule mod_php5.c>
php_flag magic_quotes_gpc off
php_flag magic_quotes_sybase off
php_flag register_globals off
php_flag session.auto_start off
php_value mbstring.http_input pass
php_value mbstring.http_output pass
php_flag mbstring.encoding_translation off
</IfModule>

# Requires mod_expires to be enabled.
<IfModule mod_expires.c>
# Enable expirations.
ExpiresActive On

# Cache all files for 2 weeks after access (A).
ExpiresDefault A1209600

<FilesMatch \.php$>
# Do not allow PHP scripts to be cached unless they explicitly send cache
# headers themselves. Otherwise all scripts would have to overwrite the
# headers set by mod_expires if they want another caching behavior. This may
# fail if an error occurs early in the bootstrap process, and it may cause
# problems if a non-Drupal PHP file is installed in a subdirectory.
ExpiresActive Off
</FilesMatch>
</IfModule>

# Various rewrite rules.
<IfModule mod_rewrite.c>
RewriteEngine on

# Block access to "hidden" directories whose names begin with a period. This
# includes directories used by version control systems such as Subversion or
# Git to store control files. Files whose names begin with a period, as well
# as the control files used by CVS, are protected by the FilesMatch directive
# above.
#
# NOTE: This only works when mod_rewrite is loaded. Without mod_rewrite, it is
# not possible to block access to entire directories from .htaccess, because
# <DirectoryMatch> is not allowed here.
#
# If you do not have mod_rewrite installed, you should remove these
# directories from your webroot or otherwise protect them from being
# downloaded.
RewriteRule "(^|/)\." - [F]

# If your site can be accessed both with and without the 'www.' prefix, you
# can use one of the following settings to redirect users to your preferred
# URL, either WITH or WITHOUT the 'www.' prefix. Choose ONLY one option:
#
# To redirect all users to access the site WITH the 'www.' prefix,
# (http://example.com/... will be redirected to http://www.example.com/...)
# uncomment the following:
# RewriteCond %{HTTP_HOST} !^www\. [NC]
# RewriteRule ^ http://www.%{HTTP_HOST}%{REQUEST_URI} [L,R=301]
#
# To redirect all users to access the site WITHOUT the 'www.' prefix,
# (http://www.example.com/... will be redirected to http://example.com/...)
# uncomment the following:
# RewriteCond %{HTTP_HOST} ^www\.(.+)$ [NC]
# RewriteRule ^ http://%1%{REQUEST_URI} [L,R=301]

# Modify the RewriteBase if you are using Drupal in a subdirectory or in a
# VirtualDocumentRoot and the rewrite rules are not working properly.
# For example if your site is at http://example.com/drupal uncomment and
# modify the following line:
# RewriteBase /drupal
#
# If your site is running in a VirtualDocumentRoot at http://example.com/,
# uncomment the following line:
# RewriteBase /

# Pass all requests not referring directly to files in the filesystem to
# index.php. Clean URLs are handled in drupal_environment_initialize().
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{REQUEST_URI} !=/favicon.ico
RewriteRule ^ index.php [L]

# Rules to correctly serve gzip compressed CSS and JS files.
# Requires both mod_rewrite and mod_headers to be enabled.
<IfModule mod_headers.c>
# Serve gzip compressed CSS files if they exist and the client accepts gzip.
RewriteCond %{HTTP:Accept-encoding} gzip
RewriteCond %{REQUEST_FILENAME}\.gz -s
RewriteRule ^(.*)\.css $1\.css\.gz [QSA]

# Serve gzip compressed JS files if they exist and the client accepts gzip.
RewriteCond %{HTTP:Accept-encoding} gzip
RewriteCond %{REQUEST_FILENAME}\.gz -s
RewriteRule ^(.*)\.js $1\.js\.gz [QSA]

# Serve correct content types, and prevent mod_deflate double gzip.
RewriteRule \.css\.gz$ - [T=text/css,E=no-gzip:1]
RewriteRule \.js\.gz$ - [T=text/javascript,E=no-gzip:1]

<FilesMatch "(\.js\.gz|\.css\.gz)$">
# Serve correct encoding type.
Header append Content-Encoding gzip
# Force proxies to cache gzipped & non-gzipped css/js files separately.
Header append Vary Accept-Encoding
</FilesMatch>
</IfModule>
</IfModule>

# $Id: .htaccess,v 1.110 2010/10/11 23:49:48 dries Exp $

[edited by: g1smd at 9:10 pm (utc) on Oct 16, 2010]

g1smd

8:39 pm on Oct 16, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



There's a number of immediately apparent issues:

# Protect files and directories from prying eyes.
<FilesMatch "\.(engine|inc|info|install|make|module|profile|test|po|sh|.*sql|th
eme|tpl(\.php)?|xtmpl)$|^(\..*|Entries.*|Repository|Root|Tag|Template)$">
Order allow,deny
</FilesMatch>


The above code is totally ineffective. The "Deny from all" instruction that would make it work is completely missing.

There are other issues. In particular, the ".*sql" part of the pattern is an accident waiting to happen. I'd use "(my)?sql" or similar here; assuming that's what the designer intended the pattern to also match (but as there's nothing in the comments we'll never know), something like this:

# Protect files and directories from prying eyes.
<FilesMatch "\.(engine|inc|info|install|make|module|profile|test|po|sh|(my)?sql|th
eme|tpl(\.php)?|xtmpl)$|^(\..*|Entries.*|Repository|Root|Tag|Template)$">
Order allow,deny
Deny from all
</FilesMatch>


--

The "options" part looks like this:

# Don't show directory listings for URLs which map to a directory.
Options -Indexes

# Follow symbolic links in this directory.
Options +FollowSymLinks


Common sense puts these two instructions on a single line:

Options -Indexes +FollowSymLinks


---

Here's the domain canonicalisation code:

 # If your site can be accessed both with and without the 'www.' prefix, you
# can use one of the following settings to redirect users to your preferred
# URL, either WITH or WITHOUT the 'www.' prefix. Choose ONLY one option:
#
# To redirect all users to access the site WITH the 'www.' prefix,
# (http://example.com/... will be redirected to http://www.example.com/...)
# uncomment the following:
# RewriteCond %{HTTP_HOST} !^www\. [NC]
# RewriteRule ^ http://www.%{HTTP_HOST}%{REQUEST_URI} [L,R=301]
#
# To redirect all users to access the site WITHOUT the 'www.' prefix,
# (http://www.example.com/... will be redirected to http://example.com/...)
# uncomment the following:
# RewriteCond %{HTTP_HOST} ^www\.(.+)$ [NC]
# RewriteRule ^ http://%1%{REQUEST_URI} [L,R=301]


I understand the need to make the installation easy for non-technical people, but there's many non-canonical URLs the above rules fail to redirect, including trailing period and included port number. The code is complicated enough, but it also fails to fully do the job it needs to do.

--

When RewriteBase is used like this it's usually a cop out:

# Modify the RewriteBase if you are using Drupal in a subdirectory or in a
# VirtualDocumentRoot and the rewrite rules are not working properly.
# For example if your site is at http://example.com/drupal uncomment and
# modify the following line:
# RewriteBase /drupal
#
# If your site is running in a VirtualDocumentRoot at http://example.com/,
# uncomment the following line:
# RewriteBase /


It is better to correctly code the rewrites, not rely on this sort of RewriteBase over-ride. More on that below.

--

Drupal includes the "usual" dodgy rewrite code with the "-f" and "-d" checks positioned first so that your server hard drive is rapidly beaten to death:

 # Pass all requests not referring directly to files in the filesystem to
# index.php. Clean URLs are handled in drupal_environment_initialize().
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{REQUEST_URI} !=/favicon.ico
RewriteRule ^ index.php [L]


The oft-quoted replacement code looks something like this:

 # Pass all requests not referring directly to files in the filesystem to
# index.php. Clean URLs are handled in drupal_environment_initialize().
RewriteCond $1 !(^index\.php|\.(gif|jpe?g|png|ico|css|js))$
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.*)$ index.php [L]


The new code is parsed many times quicker.

--

Finally, the GZIP code:

# Rules to correctly serve gzip compressed CSS and JS files.
# Requires both mod_rewrite and mod_headers to be enabled.
<IfModule mod_headers.c>
# Serve gzip compressed CSS files if they exist and the client accepts gzip.
RewriteCond %{HTTP:Accept-encoding} gzip
RewriteCond %{REQUEST_FILENAME}\.gz -s
RewriteRule ^(.*)\.css $1\.css\.gz [QSA]

# Serve gzip compressed JS files if they exist and the client accepts gzip.
RewriteCond %{HTTP:Accept-encoding} gzip
RewriteCond %{REQUEST_FILENAME}\.gz -s
RewriteRule ^(.*)\.js $1\.js\.gz [QSA]

# Serve correct content types, and prevent mod_deflate double gzip.
RewriteRule \.css\.gz$ - [T=text/css,E=no-gzip:1]
RewriteRule \.js\.gz$ - [T=text/javascript,E=no-gzip:1]

<FilesMatch "(\.js\.gz|\.css\.gz)$">
# Serve correct encoding type.
Header append Content-Encoding gzip
# Force proxies to cache gzipped & non-gzipped css/js files separately.
Header append Vary Accept-Encoding
</FilesMatch>
</IfModule>
</IfModule>


The obvious issues are the missing [L] flags and unnecessary [QSA] flags (appending the query string is the default action). The (.*) part of the pattern is non-optimum.

Additionally, using the $1 backreference in the way it is used above, opens the server to an obscure type of attack. A leading slash at the beginning of the rewrite target filepath would fix that, but that would in turn cause the RewriteBase directive to be ignored - which is why RewriteBase shouldn't be used for this purpose in the first place.

Correcting this would therefore require the user to manually add the value of the Drupal installation path to the .htaccess code rather than just uncommenting the correct RewriteBase directive. So, there's a decision to be made; better security vs. ease of installation.


There's plenty that can be optimised in the .htaccess file.

I'd guess that jdMorgan can find much more to improve than my basic initial analysis suggests.

[edited by: jdMorgan at 6:06 pm (utc) on Jan 7, 2011]
[edit reason] Corrected code as noted below. [/edit]

g1smd

10:39 pm on Oct 16, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



"The new code is parsed many times quicker."

I ran out of time to post. I was also going to mention...

The new code is parsed many times quicker. It eliminates the multiple disk reads for every URL request arriving at the server. Instead it looks to the disk to find out if the file or folder actually exists only when the extension indicates that it isn't a request for an image, or for a stylesheet or script file. Similar exclusions have also had to be added in Wordpress and in other blog, CMS, cart and forum packages. They substantially speed up the page load time, as well as staving off an early server upgrade on medium traffic sites.

jdMorgan

1:14 am on Oct 17, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



OK, I'll take the gzip code...

Reduce the original four rules (all of which must be parsed, and for css or js requests, at least two must be executed) requiring eight directives, down to one rule with five directives:

# Rules to serve gzip-compressed CSS and JS files.
# Requires both mod_rewrite and mod_headers to be enabled.
#
# If the client accepts gzip and compressed CSS and JS files exist, serve them with
# proper Content-Type headers and set the no-gzip variable to prevent double-encoding
RewriteCond %{HTTP:Accept-Encoding} gzip
RewriteCond %{REQUEST_FILENAME}.gz -s
RewriteCond $2->text/css ^css->(.+)$ [OR]
RewriteCond $2->text/javascript ^js->(.+)$
RewriteRule ^(.+\.(css|js))$ /$1.gz [T=%1,E=no-gzip:1,L]

And that <Filesmatch> on the Content-Encoding section could also be shortened-up...

<FilesMatch "\.(css|js)\.gz$">

Jim

rrforeman

1:22 pm on Oct 17, 2010 (gmt 0)

10+ Year Member



Thank you very much for this.. quite generous of you.

Can I ask a simple question about .htaccess (I am moderately noob at this).. is this file you posted a primary .htacces file that goes in the root of the website, or would this file go in each directory of your site ? Thanks.

jdMorgan

5:13 pm on Oct 17, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi rrforeman,

It would go wherever Drupal says it should go... Please note that this is a "theoretical" discussion at this point, and there is likely some more work to do on this code. Then we'll need experienced volunteers to test the resulting modified code on real Drupal sites, and report back with their problems and detailed analysis.

If, as you state, you're not very experienced with .htaccess files, I'd advise holding off until the post-testing phase of this "project."

Jim

wildbest

8:57 pm on Oct 17, 2010 (gmt 0)

10+ Year Member



The obvious issues are the missing [L] flags and unnecessary [QSA] flags (appending the query string is the default action). The (.*) part of the pattern is non-optimum.


About the unnecessary [QSA] flags.

Sometimes there are complex query strings that include fully qualified domain names plus query strings. Something like this:
http://example.com?ref=http://example2.com?q=123

And this needs to be redirected to:
http://example3.com?ref=http://example2.com?q=123

If we are redirecting to another domain name, looks like default action to append the query string is from the last "?" not the first one? What is the correct way of handling such complex query strings so that "ref=http://example2.com?q=123" is appended, not only "q=123"?

[edited by: bill at 4:09 am (utc) on Nov 26, 2010]
[edit reason] unlinked example [/edit]

jdMorgan

9:43 pm on Oct 17, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The best way is to not include illegal characters in your query strings. The question mark is reserved as the delimiter between the URL-path and the query string, and including more than one of them is an error that can lead to the trouble that you describe. Remember that Apache is intended to be small and fast above all else, and it will not "gently hold your hand and explain errors to you" the way that some high-level programming and scripting languages do. If you make a syntax error, Apache just takes its best guess at what you meant, serves the result, and then goes on the next HTTP request... quickly.

So refer to the HTTP/1.1 specification, and percent-encode all characters which are designated as "reserved" or "unwise" for use in query strings.

So your above URL+query could be linked and transmitted as
http://example.com?http%3A%2F%2Fexample2%2Ecom%3Fq%#D123
for example. While I may have encoded some characters here that aren't required to be encoded, this is not harmful -- The reverse situation of not encoding characters which should be encoded is what causes trouble.

Obviously, the script that is to receive this query string must decode the encoded characters, but this is trivial; the decoding function is built-in to many scripts and practically all scripting languages (e.g. the "URLdecode();" function).

However, none of this affects the fact that if a question mark does not appear in the RewriteRule's substitution path, then [QSA] is redundant. The [QSA] flag is only needed when it is desired to *append* additional name/value pairs to an existing query string: [QSA] is "Query String Append".

For now, I'd like to reserve this thread for contributions to improve the Drupal code posted above. We can discuss the side-topics later. Thanks!

Jim

sublime1

5:52 pm on Oct 21, 2010 (gmt 0)

10+ Year Member



g1smd, Jim --

Thanks! This is a muchly needed post -- well done indeed! I nearly barfed when I first looked at the .htaccess code that came with Drupal, and we have had some struggles with the very issues you point out and correct.

I would like to add one other point, and if I get some time, I'll see if I can do the work to do the implementation.

While most people have their Drupal instances on shared hosts and don't have access to the server-level configuration or VirtualHost config, those who do should strongly consider moving most or all of these directives out of .htaccess and into the server config instead. Unfortunately there are some minor (yet critical) differences between the way things like RewriteRule works so it's not necessarily a trivial exercise.

On a site with any traffic to speak of, this kind of processing, needed on every single request, has to be pretty expensive. The sheer size of the .htaccess file alone makes for a lot of work. VirtualHost or server config processing happens once when the server starts, so there's a potentially significant benefit to moving stuff out of .htaccess, perhaps all of it.

The only thing to remember is that when you update Drupal core, you'll get a new default .htaccess file, so you have to remember to rename it.

One other point worth making is that various modules can make the rules not hit when they should. There are numerous really poorly written and buggy Drupal contrib modules out there that fail to do the whole job (yet are widely deployed).

For example if you enable some of the optimizations, CSS and JS files are minified and given a unique hash on their query string -- patterns that gzip based on filename can get confused. Jim's pattern recognition based on explicitly specified mime-types may be better ... assuming your modules properly specify the type in their response headers. I know that one (or more) of the modules we're using in a site I work on fail miserably in ways like this -- when I finally get a chance to figure out which on is which, I'll update the thread.

So the message here: if you can, move as many of the rules here into Virtual Host or Server config files, and test, test test to make sure all the things you are doing are working.

Even if you cannot, I have found that the Google PageSpeed Firefox Plugin is invaluable for isolating and fixing performance problems.

Tom

sublime1

5:58 pm on Oct 21, 2010 (gmt 0)

10+ Year Member



And here's a link with a discussion of moving Drupal .htaccess to VirtualHost, including someone who did a performance test. While the test results are not necessarily conclusive, the outcome was a somewhat modest performance improvement, as I read them.

Well worth a thorough read, as there are some good points on the general topic.

[groups.drupal.org...]

jdMorgan

10:12 pm on Oct 21, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The general guidance is that *if* you can put your code in the server config files, then do so.

Server config code is interpreted into machine-executable form once at server start-up, while code in .htaccess is interpreted into machine-executable form for every single HTTP request to the machine. Therefore, performance is much better when the code is located in a config file.

Note that the code I posted above is not based on the MIME-type. Rather, it implements a RewriteCond-based "lookup table" to *specify* the correct MIME-type using the "T=<type>" flag in the RewriteRule.

Jim

jdMorgan

1:32 pm on Dec 7, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Another point here is that the difference between putting the code in a config file versus .htaccess will be reduced by the optimizations posted above. So, if you *cannot* put the Drupal mod_rewrite code into a config file, then these changes are even more likely to improve your server performance.

Jim

pendaco

3:05 pm on Jan 6, 2011 (gmt 0)

10+ Year Member



Great tips so far, but the code below doesn't seem to work properly?

The oft-quoted replacement code looks something like this:

# Pass all requests not referring directly to files in the filesystem to
# index.php. Clean URLs are handled in drupal_environment_initialize().
RewriteCond $1 !((\.(gif|jpe?g|png|ico|css|js)|^index\.php)$
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^ index.php [L]


The new code is parsed many times quicker.


Isn't there 1 bracket too many in
$1 !((\.(gif|jpe?g|png|ico|css|js)|^index\.php)$
? The current change throws me a 500 internal error. Unfortunately, if I remove the 1st/2nd bracket that condition doesn't seem to work either (seems to ignore it or something).

That last RewriteRule should also be
RewriteRule ^(.*)$ index.php?q=$1 [L,QSA]
according to the latest .htaccess from Drupal.


jdMorgan did it slightly different (swapping the 'if-not') for Wordpress [webmasterworld.com], that one worked without any problems.

jdMorgan

11:08 pm on Jan 6, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes, there was an extra "(" and the $1 back-reference was undefined.

I will correct the code posted above to prevent further propagation.

Jim

pendaco

8:07 am on Jan 7, 2011 (gmt 0)

10+ Year Member



Thnx a bunch for the corrections jdMorgan/Jim.
Only the last line still differs, it should be;

RewriteRule ^(.*)$ index.php?q=$1 [L,QSA]


Probably because of that difference that it doesn't seem to work for me yet as intended (still seems to pass the excluded files through index.php).


What I do know/experienced is that the QSA flag is needed in the back-end (and probably front-end depending on some modules), mainly for pages with pagination like;

www.website.com/admin/reports/dblog?page=2

pendaco

8:31 am on Jan 17, 2011 (gmt 0)

10+ Year Member



Nobody?

I can't get that rewriterule to work with;
RewriteRule ^(.*)$ index.php?q=$1 [L,QSA]

jdMorgan

10:10 pm on Jan 19, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Being very clear can save time. Is this the exact code for the rule you're using in your .htaccess file, and if so, what exactly "does not work" about it?

# If the request does not resolve to a filetype not generated by Drupal itself and
# if the request does not resolve to a physically-existing file or directory, then
# pass the request to Drupal's index.php script, appending the requested URL-path
# as the "q=" parameter value to any query string already present in the request.
RewriteCond $1 !(^index\.php|\.(gif|jpe?g|png|ico|css|js))$
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.*)$ index.php?q=$1 [QSA,L]

Jim

pendaco

9:46 am on Jan 20, 2011 (gmt 0)

10+ Year Member



Yep, should have posted my complete code, sorry for that..

It's exactly the same as the code you posted above except for 1 small change in the 1st line;
RewriteCond $1 !(^index\.php$|\.(gif|jpe?g|png|ico|css|js))$
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.*)$ index.php?q=$1 [QSA,L]


For some reason I got that dollar sign in there behind the index.php, thought I copy-pasted it from the edited post above, not sure why else it would be there..

Anyhow, there was an additional mix-up in my server/drupal config which led me to believe it wasn't working;

- We run ispconfig which uses error files from /error/*.html (the 404.html file was missing..)
- We also use an additional Drupal module (Custom Error) to handle the basic 404/403 errors

While the 403 error in the module was configured properly it was redirecting 404 errors to the home domain (index.php).

So, /error/404.html wasn't found > triggered its own 404 error > picked up by the Custom Error module > redirected to index.php > caused a hell of a confusion and some additional grey hairs ;-)


It's working now as it should, so thnx for all help + additional fixes!