Forum Moderators: phranque

Message Too Old, No Replies

Cause a 410 when any query parameters used

         

jehoshua

7:45 am on Feb 7, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



One of the crawlers on our site is continuing to use URI's that do not exist now. The site was Wordpress previously. Some entries like

GET /index.php?xml_sitemap=params=pt-post-2014-10
GET /index.php?xml_sitemap=params=pt-post-2013-11
GET /?p=35
GET /?m=201312
GET /?page_id=2
GET /?paged=5&cat=1


I need to cause a 410 whenever any URI contains a question mark. The current site does not use any queries or parameters at all. This solution at [stackoverflow.com ] seems to be what is needed ?

# Make sure an error page is defined for 410
ErrorDocument 410 /path/to/custom/410.html

# For any non-empty query string
RewriteCond %{QUERY_STRING} ^.
# Match any path and send a 410 response
# The [L] prevents further matches from being executed
RewriteRule ^ - [R=410,L]


Will that suffice please ?

w3dk

9:46 am on Feb 7, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



Yes, that will do what you require.

The regex "^." (in the RewriteCond directive) could be simplified to just "." (a single dot).

You don't necessarily need a custom 410 ErrorDocument, if these URLs are just being requested by a "crawler".

Note that this needs to go near the top of the .htaccess file, before, any "front-controller" section. Order matters.

lucy24

5:04 pm on Feb 7, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You don't necessarily need a custom 410 ErrorDocument, if these URLs are just being requested by a "crawler".
I always recommend having a 410 document, because the Apache default 410 message is pretty scary--not something you'd want to throw at an unsuspecting human using an old bookmark. But it doesn't necessarily have to be a separate document; on many sites you can say something like
ErrorDocument 410 /404.html
using the same physical file as you use for 404s.

Order matters.
Within any given module, that is. My general arrangement within mod_rewrite is

-- requests that should stop right here with no further handling, such as for robots.txt or error documents ([L] flag alone)
-- requests with 403 response (also some manual 404 or 302 if they're functioning as access control)
-- requests with 410 response
-- requests with 301 response, from most specific to most general
-- requests with internal rewrite alone ([L] flag)
-- requests with no [L] flag (rare, for example setting a cookie)

jehoshua

9:11 pm on Feb 7, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



The regex "^." (in the RewriteCond directive) could be simplified to just "." (a single dot).


What will be the difference between this

RewriteCond %{QUERY_STRING} ^.


and this then ?

RewriteCond %{QUERY_STRING} .


I always recommend having a 410 document, because the Apache default 410 message is pretty scary--not something you'd want to throw at an unsuspecting human using an old bookmark


Yes I agree, although the subject is a 410 for crawlers, we need not forget that humans can use the old URI's as well

Within any given module, that is. My general arrangement within mod_rewrite is

-- requests that should stop right here with no further handling, such as for robots.txt or error documents ([L] flag alone)
-- requests with 403 response (also some manual 404 or 302 if they're functioning as access control)
-- requests with 410 response
-- requests with 301 response, from most specific to most general
-- requests with internal rewrite alone ([L] flag)
-- requests with no [L] flag (rare, for example setting a cookie)


Thank you, great to outline the order needed.

phranque

9:38 pm on Feb 7, 2021 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



What will be the difference between this

RewriteCond %{QUERY_STRING} ^.


and this then ?

RewriteCond %{QUERY_STRING} .

the first pattern matches any string that starts with 1 of any character and the second pattern matches any string that contains 1 of any character.
i.e., they both match any string except a null string.

jehoshua

10:16 pm on Feb 7, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



i.e., they both match any string except a null string


Okay thanks. I uploaded the new .htaccess and it works fine, and yes, it doesn't work for

http://example.com/?


This is the modified .htaccees,

Options +ExecCGI +FollowSymLinks
AddHandler cgi-script .pl

# Make sure an error page is defined for 410
ErrorDocument 410 /410.shtml

RewriteEngine on

# For any non-empty query string
RewriteCond %{QUERY_STRING} ^.
# Match any path and send a 410 response
# The [L] prevents further matches from being executed
RewriteRule ^ - [R=410,L]


Possibly the ErrorDocument line should be after the mod_rewrite ?

w3dk

10:47 pm on Feb 7, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



Possibly the ErrorDocument line should be after the mod_rewrite ?


From a syntactical point of view it doesn't matter. However, for readability, the ErrorDocument directives should be defined first (as you have done).

The ErrorDocument directive is a "core" directive. By placing the directive before or after the mod_rewrite directives does not change the order in which the directives are processed.

Order matters within a particular module (as lucy24 pointed out above). So these mod_rewrite directives should go before other mod_rewrite directives (eg. other redirects and rewrites).

[edited by: w3dk at 10:58 pm (utc) on Feb 7, 2021]

lucy24

10:55 pm on Feb 7, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Edit: OK, that's two days in a row I have overlapped w3dk. :(
I uploaded the new .htaccess and it works fine, and yes, it doesn't work for
http://example.com/?
Right, because the query string has no content and is therefore the same as if there were no query string. (Hence the “non-empty query string” comment.) Does the site actually receive requests in this form? If not, it's a non-issue. If you do get a lot of requests with null query, you would have to change the Condition to
RewriteCond %{THE_REQUEST} \?
meaning “The request contains a literal question mark”.

Possibly the ErrorDocument line should be after the mod_rewrite?
It makes no difference whatsoever. Each module is an island. RewriteRules are mod_rewrite; ErrorDocument directives are core. Personally I like to put ErrorDocument directives near the top of htaccess because it's just a few lines.

NickMNS

12:42 am on Feb 8, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This is a bad idea IMO, some common referrers, most notably Facebook, add query params to the URLs linked from within their service, serving a 410 would result in referral traffic from these sites landing on the 410 page as opposed to desired page. FB appends a "fbclid" parameter to the links, and some Apple devices append params from Google search. If you see the same params being used, target those specifically instead of any param.

lucy24

1:11 am on Feb 8, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Luckily we are already in mod_rewrite, where you can slather on RewriteCond to your heart's content. Matter of fact

:: shuffling papers ::

Yup, I've got one myself:
RewriteCond %{QUERY_STRING} .
RewriteCond %{QUERY_STRING} !^fbclid
RewriteRule (^|\.html|/)$ - [F]

When I go over my access logs, I look at any requests for the stylesheet that belongs specifically to error documents, because that lets me know when a human has been served a 403 or 404.

jehoshua

1:28 am on Feb 8, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



From a syntactical point of view it doesn't matter. However, for readability, the ErrorDocument directives should be defined first (as you have done).


Thanks, and I see "Lucy24" has a preference for the same, up the top. So be it.

Right, because the query string has no content and is therefore the same as if there were no query string. (Hence the “non-empty query string” comment.) Does the site actually receive requests in this form? If not, it's a non-issue


Okay thanks for the explanation. I don't think there are any requests of this nature, so it's a non issue.

This is a bad idea IMO, some common referrers, most notably Facebook, add query params to the URLs linked from within their service ..(snip)


I'm aware that Facebook append their own tracking links, we get NONE of these. If we do, I'll use "Lucy24" approach and strip out those tracking ID's.

Yup, I've got one myself:

RewriteCond %{QUERY_STRING} .
RewriteCond %{QUERY_STRING} !^fbclid
RewriteRule (^|\.html|/)$ - [F]


When I go over my access logs, I look at any requests for the stylesheet that belongs specifically to error documents, because that lets me know when a human has been served a 403 or 404.


Thanks for that, I may need it someday. :)

jehoshua

12:08 am on Feb 9, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



The mod_rewrite rules are working fine, they are in http://example.com/public_html/ path

There is another path for testing purposes and it is of course throwing a 410 now. ..lol

Would the following line added up near the top suffice ?

RewriteRule ^(testpath)($|/) - [L]


where "testpath" is the path where I don't want the 410

jehoshua

2:13 am on Feb 9, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



This is the new .htaccess , and it now works in http://example.com and forces a 410, and if in /testpath, it exists the testing and allows the query parameters

Options +ExecCGI +FollowSymLinks
AddHandler cgi-script .pl

# Make sure an error page is defined for 410
ErrorDocument 410 /410.shtml

RewriteEngine on

# exit the rewriting process if query is in /testpath
RewriteCond %{REQUEST_URI} ^/testpath [NC]
RewriteRule .* - [L]

# For any non-empty query string
RewriteCond %{QUERY_STRING} ^.
# Match any path and send a 410 response
# The [L] prevents further matches from being executed
RewriteRule ^ - [R=410,L]

lucy24

2:54 am on Feb 9, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Would the following line added up near the top suffice ?

RewriteRule ^(testpath)($|/) - [L]
Yes. In fact you can simplify to
RewriteRule ^testpath - [L]
unless you happen to have a publicly accessible directory called, say, /testpathogen. You don't need the parentheses around (testpath), since you're not capturing.

Edit: There's definitely no reason for the version with the REQUEST_URI condition. Whenever possible, put patterns into the body of a RewriteRule rather than in a RewriteCond; otherwise the server has to stop and evaluate the condition on every request ever.

jehoshua

3:22 am on Feb 9, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



I tried this and it didn't work

RewriteRule ^testpath - [L]


and then I tried this and it didn't work

RewriteRule ^/testpath - [L]


yet this works ..

RewriteCond %{REQUEST_URI} ^/testpath [NC]
RewriteRule .* - [L]


where "testpath" is a path not a file

jehoshua

3:30 am on Feb 9, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



What if I created a .htaccess file in the path /testpath as follows ?

# disable mod_rewrite
RewriteEngine off


..later, ..yes that did it

w3dk

11:59 am on Feb 9, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



I tried this and it didn't work
> RewriteRule ^testpath - [L]

yet this works ..
> RewriteCond %{REQUEST_URI} ^/testpath [NC]
> RewriteRule .* - [L]


The only real difference with the first rule is the absence of the NC flag. Which would imply the request is not actually "testpath", but perhaps "TestPath" or some other mixed case variant?

Just to add, you could also have put the check (to exclude "/testpath") directly in the rule that triggers the 410. For example:


RewriteCond %{QUERY_STRING} .
RewriteRule !^testpath - [NC,G]


So the rule is only triggered when the URL-path does NOT (!) start with "testpath" (slash prefix intentionally omitted) - case-insensitive.

And... the "G" flag is shorthand for "R=410" and when specificing a non-3xx status you don't actually need the "L" flag, since it is implied.

jehoshua

9:47 pm on Feb 9, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



Thanks for your reply @w3dk

Currently it all works fine, the /public_html? path allows all valid URI's, and if there is a query, it throws a 410. In the 'testpath' folder, queries are permitted. So, what you are suggesting is to ..

1. Remove the .htaccess in the 'public_html/testpath' folder
2. Modify .htaccess in /public_html as follows ..

A. Here it is now

Options +ExecCGI +FollowSymLinks
AddHandler cgi-script .pl

# Make sure an error page is defined for 410
ErrorDocument 410 /410.shtml

RewriteEngine on

# For any non-empty query string
RewriteCond %{QUERY_STRING} ^.
# Match any path and send a 410 response
# The [L] prevents further matches from being executed
RewriteRule ^ - [R=410,L]


B. Add those 2 lines ..

Options +ExecCGI +FollowSymLinks
AddHandler cgi-script .pl

# Make sure an error page is defined for 410
ErrorDocument 410 /410.shtml

RewriteEngine on

# exclude /testpath from the next set of conditions
RewriteCond %{QUERY_STRING} .
RewriteRule !^testpath - [NC,G]

# For any non-empty query string
RewriteCond %{QUERY_STRING} ^.
# Match any path and send a 410 response
# The [L] prevents further matches from being executed
RewriteRule ^ - [R=410,L]


Is that correct ? PS - I don't know why the code I add is not properly formatting in this thread ?

lucy24

10:14 pm on Feb 9, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



RewriteCond %{QUERY_STRING} .
RewriteRule !^testpath - [NC,G]
The one problem with this construction is that the server then has to evaluate conditions on every request, not just pages. Oh, and do you really need [NC] here? If it's a directory that only you use, surely you can trust yourself not to type TESTPATH by mistake? The drawback to [NC] is that the server then has to do an extra step: convert the whole request into lower-case before matching it against the pattern. Those picoseconds add up.

I don't know why the code I add is not properly formatting in this thread ?
I think the [ code ] markup assumes some language or other, so you will see unexpected highlighting. Don't know what it thinks is so special about the numerical string "410", though :)

jehoshua

10:55 pm on Feb 9, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



If it's a directory that only you use, surely you can trust yourself not to type TESTPATH by mistake?


Yes, it is a directory that only I use, so as all (valid) queries are done in only that directory and only by myself, it is best to just leave things as they are, and have the .htaccess in the /testpath state
RewriteEngine off


I think the [ code ] markup assumes some language or other, so you will see unexpected highlighting. Don't know what it thinks is so special about the numerical string "410", though :)


Maybe the "gone" throws the markup ? :)

w3dk

12:02 am on Feb 10, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



So, what you are suggesting is to ..

1. Remove the .htaccess in the 'public_html/testpath' folder
2. Modify .htaccess in /public_html as follows ..


I wasn't really suggesting that (or at least, not intending to). Keeping the "/testpath/.htaccess" file with "RewriteEngine Off" is a good solution and is indeed preferred if you want to override everything in the parent config.

I was just "fixing" the code that you said was not working - an academic excercise.

Those two lines should not be "added", they replace the existing directives. They do the same thing - block all URLs that contain a query string, except for the "/testpath" directory.

Oh, and do you really need [NC] here?


Well, no. Except that that was the only thing that differentiated the working rule and the non-working rule in the preceding post. Which would seem to imply that "testpath" was not all lowercase - as stated in the directive (or there was a typo somewhere along the way).

jehoshua

9:35 pm on Feb 10, 2021 (gmt 0)

10+ Year Member Top Contributors Of The Month



I wasn't really suggesting that (or at least, not intending to). Keeping the "/testpath/.htaccess" file with "RewriteEngine Off" is a good solution and is indeed preferred if you want to override everything in the parent config.


Okay, thanks for the clarification