Forum Moderators: phranque

Message Too Old, No Replies

HELP! Remove /index.html from paths in .htaccess

         

KallenWeb

5:30 am on Mar 10, 2023 (gmt 0)

5+ Year Member Top Contributors Of The Month



Per this thread:
[webmasterworld.com...]

We have been using this code on many many websites and it has been working fine:
# Redirect index in any directory to root of that directory
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]+/)*index\.[^\ ]*\ HTTP/
RewriteRule ^(([^/]+/)*)index\.(php|html?)$ http://www.example.com/$1? [R=301,L]

As of today it stopped working!

Redirect checker shows it is adding something to the end, resulting in a 404. For example:

https://www.example.com/kalamazoo-web-site-packages-pricing/index.htm
301 Moved Permanently
https://www.example.com/kalamazoo-web-site-packages-pricing/%3f (which results in a 404).

Help!

lucy24

5:53 am on Mar 10, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Gosh, that's an old thread.

%3f = question mark

... which is exactly what you see at the end of the target, presumably so you can concurrently strip any queries. (But why? Are you actually getting requests with spurious queries? Legitimate requests, that is; malign robots with queries should be handled in other ways.) So the output, the redirect target, is being put through something that percent-escapes its result--and this cannot be explained using the information given. You could remove the ? and instead use the flag QSD (assuming Apache 2.4 or later), though this doesn't explain why the ? is being escaped in the first place.

Various tangential stuff:

Do you really have visible URLs in .php AND .htm AND .html? If not, that's putting the server to extra work parsing Regular Expressions that it will never really need. Legitimate visitors such as search engines won't ask for index.whatever by name unless that has at some time in the past been part of a visible URL that they periodically re-visit. Same, of course, for humans: did your URLs formerly end in explicit index.whatever, which might still exist in links or bookmarks?

Alternate method (setting aside the ? issue):
RewriteCond %{REQUEST_URI} ^/((?:\w+/)*)index\.html
RewriteRule index\.html$ https://example.com/%1 [R=301,NS,L]
Doing it this way saves having to evaluate conditions every single time, and saves making a capture that 99 times out of 100 will end up being thrown away unused. The [NS] flag prevents the rule from firing on internal subrequests, specifically the ones made by mod_dir, so it obviates the need for a Condition involving THE_REQUEST.

phranque

6:15 am on Mar 10, 2023 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



So the output, the redirect target, is being put through something that percent-escapes its result--and this cannot be explained using the information given.

fwiw i've never noticed this behavior elsewhere.
i also don't remember this problem being described in this forum previously.

KallenWeb

6:26 am on Mar 10, 2023 (gmt 0)

5+ Year Member Top Contributors Of The Month



Yes. Bluehost has changed something. I have no idea what. It has worked fine for years. So frustrating! The .htaccess code is still working fine on our sites with other hosting, such as godaddy. Thanks!

KallenWeb

9:46 am on Mar 10, 2023 (gmt 0)

5+ Year Member Top Contributors Of The Month



I tried this two ways but it does not seem to eliminate the /index.html at the end.

RewriteCond %{REQUEST_URI} ^/((?:\w+/)*)index\.html
RewriteRule index\.html$ https://www.example.com/%1 [R=301,NS,L]

and

RewriteCond %{REQUEST_URI} ^/((?:\w+/)*)index\.html
RewriteRule index\.html$ https://example.com/%1 [R=301,NS,L]

lucy24

6:13 pm on Mar 10, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I tried this two ways
Careful! If you redirect to a non-canonical form of your hostname, then when the internal request meets your canonicalization redirect, the whole thing including “index.html” will get externally redirected. Based on OP, the site uses the with-www form.

It can't hurt to add the [NE] flag, though ordinarily it shouldn't be needed. Try it. Also try removing the final ? and instead using the [QSD] flag, assuming you're on an Apache version that supports it.

It really does sound as if the host has done something wonky. What is your current RewriteOptions setting? The only thing I can think of is that the main config file has added one or more RewriteRules, and something you don't know about is getting inherited down to your local htaccess. If the Apache version is 2.4.8 or later, see if anything changes if you add
RewriteOptions IgnoreInherit
somewhere in your htaccess. (It can go anywhere, but immediately before the mod_rewrite section makes most sense.)

Finally, has anything changed in the way the host handles https?

KallenWeb

9:09 am on Mar 13, 2023 (gmt 0)

5+ Year Member Top Contributors Of The Month



I'm not very knowledgeable about this. Can you suggest changes to this that would work?

RewriteCond %{REQUEST_URI} ^/((?:\w+/)*)index\.html
RewriteRule index\.html$ https://www.example.com/%1 [R=301,NS,L]

or

RewriteCond %{REQUEST_URI} ^/((?:\w+/)*)index\.html
RewriteRule index\.html$ https://example.com/%1 [R=301,NS,L]

lucy24

5:53 pm on Mar 13, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The only difference between the two versions is that one redirects to with-www while the other redirects to without-www. Only you know which of the two is correct for your site.

Sgt_Kickaxe

10:46 pm on Mar 13, 2023 (gmt 0)



Yes. Bluehost has changed something. I have no idea what.

Check to see if they added "extra security" to your hosting account, and if so, read the fine print. Several popular hosts are providing that protection on your behalf by passing it through Cloudflare, for free, of course. It might be breaking things and/or causing visitors to bounce instead of "waiting to have their browser verified".

dolcevita

12:15 pm on Mar 22, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Did you solve this issue? I come behind today that i have same issue. Not sure when it started and i;m hosted by A2hosting and use CloudFlare.
[webmasterworld.com...]

KallenWeb

3:52 pm on Mar 22, 2023 (gmt 0)

5+ Year Member Top Contributors Of The Month



I have tentatively resolved it but see the next message.

[edited by: KallenWeb at 4:08 pm (utc) on Mar 22, 2023]

KallenWeb

4:05 pm on Mar 22, 2023 (gmt 0)

5+ Year Member Top Contributors Of The Month



@lucy24, When I removed the ? here, it did start working again.

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]+/)*index\.[^\ ]*\ HTTP/
RewriteRule ^(([^/]+/)*)index\.(php|html?)$ http://www.example.com/$1? <-- [R=301,L]

Will removing that ? cause any potential issues?

I'm not apache/.htaccess knowledgeable enough to know. I'm scared to put it back on 50 websites to find out I caused other problems.

Oh, and adding RewriteOptions IgnoreInherit did not resolve the issue (tried that first).

Thanks.

lucy24

5:26 pm on Mar 22, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Either in this thread or the adjoining one, I explained that there are two ways to remove the query string. One way is what you were doing at first: put a ? at the end of the target. The other way is to add QSD to the existing flags, so it becomes [R=301,L,QSD]. Apache doesn't care what order the flags are in.

If you're not currently getting requests with unwanted query strings, neither one is necessary.

KallenWeb

9:02 am on Mar 23, 2023 (gmt 0)

5+ Year Member Top Contributors Of The Month



@Lucy24, Thank you for your patience. If I remove the ? (causing the function to work normally removing index.html from the url), should I add in the QSD as described? I'm not sure what an unwanted query string is or what function the ? or QSD is performing. Thanks.

lucy24

4:18 pm on Mar 23, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



A query string is the part of the URL after the question mark, like
shirt.php?size=large&color=green&style=goofy

Some sites like to convert those into something friendlier-looking, like
/shirt/large/green/goofy/

But this doesn't seem to apply to your site, so you can safely disregard the whole issue, though the QSD flag will do no harm.

You may also see queries in requests from malign robots, but those don't matter because the request will--one hopes--be blocked anyway.

KallenWeb

12:23 am on Sep 18, 2023 (gmt 0)

5+ Year Member Top Contributors Of The Month



I have a site that uses .htm rather than .html for the extensions. To make this work, can I just change php to htm in the below: Also on the first line, is HTTP/ correct, or should it be HTTPS/ since my sites all have ssl enabled.

# Redirect index in any directory to root of that directory
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]+/)*index\.[^\ ]*\ HTTP/
RewriteRule ^(([^/]+/)*)index\.(php|html?)$ https://www.example.com/$1 [R=301,L]

to

# Redirect index in any directory to root of that directory
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]+/)*index\.[^\ ]*\ HTTP/
RewriteRule ^(([^/]+/)*)index\.(htm|html?)$ https://www.example.com/$1 [R=301,L]



[edited by: not2easy at 3:08 am (utc) on Sep 18, 2023]
[edit reason] please use example.com for readability [/edit]

lucy24

1:02 am on Sep 18, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



on the first line, is HTTP/ correct
Yes, because this isn't about http/https. Glance at your logs and you'll see that it always says HTTP/ followed by a number: currently 1.0 (only used by elderly robots), 1.1 or 2.0 (most humans now).

(htm|html?) is redundant, since html? with question mark encompasses both. If your sites tend to use both forms, it may be easier to just say html? everywhere, and then you don't have to change the rules. Or was that a typo for (php|html?) ?

The RewriteCond in THE_REQUEST is common in index redirects, but let me suggest an alternative. Replace “html” in both places with whatever fits your site.
RewriteCond %{REQUEST_URI} ^/((?:\w+/)*)index\.html
RewriteRule index\.html$ https://example.com/%1 [R=301,NS,L]
The reasoning behind this version is that 99 requests out of 100 will not involve “index.xtn”--in fact, apart from malign robots, I don’t see it at all except on one older site that used to have visible URLs in “index.html”.* The act of capturing, in and of itself, creates a teeny bit of work for the server, so it makes sense to defer it for those times when the capture will actually be needed. Hence the Condition and the %1. Note the [NS] flag. This is crucial, because it prevents the rule from firing when "index.xtn" is an internal subrequest, generally from mod_dir.

Caution: You cannot do this in a CMS index.php rewrite, because mod_rewrite doesn't count as an internal subrequest. There you need to use THE_REQUEST, but you can still capture from the Condition.


* Also when I'm using HTML View in Fetch, since this starts with clicking on a physical file that is generally named index.html.

KallenWeb

5:10 pm on Sep 26, 2023 (gmt 0)

5+ Year Member Top Contributors Of The Month




(htm|html?) is redundant, since html? with question mark encompasses both. If your sites tend to use both forms, it may be easier to just say html? everywhere, and then you don't have to change the rules. Or was that a typo for (php|html?) ?

- No, htm|html? is what I meant to type. But if html? covers both, I assume I can just use (html?) or even get rid of the brackets.

The RewriteCond in THE_REQUEST is common in index redirects, but let me suggest an alternative. Replace “html” in both places with whatever fits your site.

RewriteCond %{REQUEST_URI} ^/((?:\w+/)*)index\.html
RewriteRule index\.html$ https://example.com/%1 [R=301,NS,L]

The downside to this is we have some sites that use .htm in some places and .html in other places. Sloppy I know, but it still happens.

To account for that, could you suggest a modification to this that would work for both? Would it be something like this since ? makes it work for htm and html?

RewriteCond %{REQUEST_URI} ^/((?:\w+/)*)index\.html?
RewriteRule index\.html?$ https://example.com/%1 [R=301,NS,L]

lucy24

7:07 pm on Sep 26, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yes, html?$ would cover both htm and html. With the ending anchor $ you have excluded wonky alternatives like “htmx” or “htmlp” and lots of other things that probably don't exist. So unless your sites have weird URLs with .html in the middle and then more stuff after the html (query strings don't count), you don't even need the closing anchor, either in the main rule or its condition.

Psst: You can type in [ quote ] and [/ quote ] manually (omitting the spaces).