Forum Moderators: phranque

Message Too Old, No Replies

allow access to certain files despite badbot ban

coding worked pre-Apache-2.4; fails now; need update

         

stapel

12:40 pm on May 22, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Misbehaving users on my site generally get banned by IP address, with said IP address assigned to a named environment:

SetEnvIf {banned IP address} blocked

But access to the 403 ErrorDocument, a feedback form (in case the block was a mistake), a terms-of-use page, etc, was still desired. The list of files to which banned users were still allowed (that is, the list of files excepted from the ban) was set up as its own environment:

SetEnvIf Request_URI "^(/an/allowed\.file|/another/allowed\.file|/AndSoForth\.files)$" allowed

Under previous versions of Apache, the following blocked access to those banned users to all but the list of excepted files:

<Files *>
order deny,allow
deny from env=blocked
allow from env=allowed
</Files>

With what would one replace this for Apache 2.4+?

Thank you.

Eliz.

lucy24

5:57 pm on May 22, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Huh. If you're going to say * (universal wild card) what's the point of even having an envelope? It isn't like robot.txt where you always have to name some user-agent; Allow/Deny directives can be loose in the config or htaccess, probably making fractionally less work for the server.

I use envelopes for specific filenames, such as
<Files "forbidden.html">
Order Deny,Allow
Allow from all
</Files>
Then you don't need to think about further environmental variables, because <Files> or <FilesMatch> will set its own rules. And you don't need to change anything because you're lucky enough to have moved up to 2.4 ;)

There's also the un-set option, as in
SetEnvIf Remote_Addr ^128\.30\.52 !keep_out

<tangent>
Using an environmental variable for IP addresses-- which can be used as-is-- seems like a roundabout way of doing things. Can you talk about how and why you use this method? Does 2.4 allow CIDR ranges in mod_setenvif? (I know it does with mod_rewrite, which makes Conditions a lot tidier.)
</tangent>

stapel

9:51 pm on May 22, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Frequently, badbots and other scrapers come through, stealing content. Obviously, these agent do not respect robots.txt directives.

Many of the scrapers try once or twice, and then go away. As soon as the malicious behavior is detected, the source IP address is blocked, being placed into the "blocked" environment. Malicious agents are blocked from accessing and and all files: <Files *>

However, some scraping happens out of ignorance, etc, and the user would like to apologize and regain access. Said apology can be effected via the 403 ErrorDocument, which is a PHP feedback form. However, in order to use this form, the user needs to have access to the relevant files. By calling said files, the user puts himself into the "allowed" environment.

So agents who are hitting the server hard should be "blocked", but people who made a mistake need to be "allowed" to ask for restored access.

In other words, I'm trying to accomplish what is outlined in the second "code" box within this [webmasterworld.com] thread.

Thank you.

Eliz.

lucy24

2:07 am on May 23, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The part I'm not getting is
#1, why the wild-card <Files *> envelope is needed at all (I think you may have misunderstood my reference to robots.txt-- I was just talking about syntax)
and
#2, why the files that get special treatment can't simply be listed in a <Files> envelope of their own, with its own access rules. (In addition to the obvious exemption for the 403 document-- needed to prevent infinite loops-- I've also got exemptions for robots.txt, stylesheets and the favicon. The last two help me identify wrongly blocked humans.) Are there any circumstances where you would not want someone to see even the maybe-permitted files?

2002, wow, wonder what Apache version everyone was using back then? (I also wonder what jdMorgan is up to, but never mind that. He dropped from sight shortly after I started reading this forum.)

Does Apache 2.4 simply not recognize the * locution? Frankly I'm surprised it even worked in 2.2; it's so minimalist, it isn't documented. It would not be the first time someone did something in Apache that the docs claim isn't possible. Personally I'd have said something like <FilesMatch "\w\.\w"> (unanchored) all along if I absolutely had to have a catchall envelope.

<topic drift>
The reference to the contact form intrigues me because for several years I've been visited by a botnet whose normal pattern includes a 403 (bad referer) followed by a request for the contact page. It started soon after I created the page: how in the world did they all find out about it? It wasn't just guessing; requests only started after the page (in a subdirectory) actually existed.
</topic drift>

not2easy

9:29 am on May 23, 2015 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I have this identical format in some htaccess files where I am using a similar (php based) robot trap. It still works as written despite its age. To save time and have the referred lines here, this is the section of .htaccess from that thread:
# Block bad-bots using lines written by bad_bot.pl script above
SetEnvIf Request_URI "^(/403.*\.html|/robots\.txt)$" allowsome
<Files *>
order deny,allow
deny from env=getout
allow from env=allowsome
</Files>

The "|" Pipe character shown in the old posts used to be shown as a broken pipe, to be corrected by the user. I replaced it in the quote so as not to confuse the intent.
((/403.*\.html¦/robots\.txt) = (/403.*\.html|/robots\.txt))


The trap catches individual IPs and sends me an email which I use to look up the CIDR and add to the "deny from" list in that htaccess file, replacing individual IPs which can really add up fast. It probably could use some updated coding, but it "works" so it stays as is.

stapel

1:47 pm on May 23, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



not2easy said:
...this is the section of .htaccess from that thread:
order deny,allow
deny from env=getout
allow from env=allowsome

This old "order deny,allow" has been replaced with "Require" directives. I'd had to stop using the above-referenced format when my server was upgraded to Apache 2.4.

lucy24 said:
Does Apache 2.4 simply not recognize the * locution?

The server is fine with "<Files *>". When an IP address is placed in the "blocked" environment, the user from the IP address is blocked, and the server has no issues. It's just that I can't seem to find a way to let the user through to see certain select files.

lucy24 said:
I'm not getting...why the wild-card <Files *> envelope is needed at all...

I'm using "<Files *>" to block access to everything on the server, following (slave-ishly, in my ignorance) the rules, etc, set out in the 2002 post by jdMorgan. I have no idea whether there is, or was at the time, a different or better way to accomplish this.

lucy24 said:
I'm not getting...why the files that get special treatment can't simply be listed in a <Files> envelope of their own, with its own access rules.

Using a separate envelope is fine by me. Would the envelope be something along the lines of the following?
<Files "^(/an/allowed\.file|/another/allowed\.file|/AndSoForth\.files)$">
Require [someting]
</Files>

If so, what would be a good set-up for the "[something]"? Would I need an "AllowOverride" to override the "blocked" environment? How would this work with the existing ban on access to everything else (or would this replace the existing language)?

Thank you for your patience.

Eliz.

lucy24

6:22 pm on May 23, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm using "<Files *>" to block access to everything on the server

I wish JDMorgan were here so I could ask him to explain this. It seems as if the envelope has the identical effect as leaving the whole list lying loose in your config file, only the envelope makes a teeny bit more work for the server on every request.

:: detour to check docs [httpd.apache.org] ::

each group is processed in the order that they appear in the configuration files.

So if you have more than one <Files> or <FilesMatch> matching your request, later ones override earlier ones simply because they are later. (Docs also say that #1 <Directory> is done by itself before #2 <DirectoryMatch>, while #3 <Files> and <FilesMatch> are done concurrently. This is actually relevant, because it means you can change your mind and switch between <Files> and <FilesMatch> without affecting anything.) If you've got <Location> that's #4, while the newly added <If> is #5.

Oh, and
Nested sections are merged after non-nested sections of the same type.
...
Sections inside <VirtualHost> sections are applied after the corresponding sections outside the virtual host definition. This allows virtual hosts to override the main server configuration.

That may or may not apply to your server, depending on how much stuff you've got inside your VirtualHost envelopes. If all you've got is a few lines defining server name and so on, ignore it.

All of this appears to be identical between 2.2 and 2.4, barring the <If> bits which obviously don't apply in 2.2 and earlier. But the 2.4 docs add an explanatory paragraph:
Later sections override earlier ones, however each module is responsible for interpreting what form this override takes. A later configuration section with directives from a given module might cause a conceptual "merge" of some directives, all directives, or a complete replacement of the modules configuration with the module defaults and directives explicitly listed in the later context.

That wraps up the Apache docs. (Note for future reference that in 2.0 and 2.2 the fragment is spelled "mergin" while in 2.4 they've fixed the typo to "merging".)

Using a separate envelope is fine by me. Would the envelope be something along the lines of the following?
<Files "^(/an/allowed\.file|/another/allowed\.file|/AndSoForth\.files)$">
Require [ someting ]
</Files>

FilesMatch applies only to the filename, not its full path (docs again, just so you know you're not taking my unsupported word on this):
Directives enclosed in a <Files> section apply to any file with the specified name, regardless of what directory it lies in.

If there is a risk of other, no-special-handling files having the same name, you'd need to put the rules inside <Directory> sections for the first part of each path. This is your own server, right?, so everything is happening in the config file? If you're certain the name will never occur anywhere else, it can be <Files> alone-- but it might still save the server a bit of work to tuck everything away in a <Directory> section.

Would I need an "AllowOverride" to override the "blocked" environment?

No, AllowOverride is a directive permitting .htaccess files in specified directories. (The words after "AllowOverride" determine which mods can be affected by htaccess.) Rules within <Files> envelopes supersede rules outside the envelope-- and also, as discussed earlier, each new envelope supersedes any earlier ones. So if you want to allow universal access, simply set a fresh set of Allow/Deny directives. For example, you should always have a
<Files "robots.txt">
Order Deny,Allow
Allow from all
</Files>
so no malign robot can ever say "But I tried to read robots.txt ::whine:: and they wouldn't let me."

stapel

12:22 am on May 24, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



lucy24 said:
<Files "robots.txt">
Order Deny,Allow
Allow from all
</Files>

I'm pretty sure that I cannot use this "Order Deny,Allow" syntax, since my server is using Apache 2.4.

lucy24 said:
If there is a risk of other, no-special-handling files having the same name, you'd need to put the rules inside <Directory> sections for the first part of each path.

I'm trying to accomplish this all within the .htaccess file. The <Directory> directive is not allowed in .htaccess files.

This setup seems to be working correctly:

<Files *>
<RequireAll>
Require all granted
Require not env blocked
</RequireAll>
</Files>

<Files .htaccess>
Require all denied
</Files>

<FilesMatch "403.php|403oops.htm|403thank.htm|terms.htm|linking.htm|license.htm|trynguess.php|purple.png|logo.gif|mailit.php$">
Require all granted
</FilesMatch>

The <Files *> section grants full access to everybody who hasn't been "blocked" by IP address.

The <Files .htaccess> section blocks even "good" users from viewing the .htaccess file.

The <FilesMatch> section restores access to those who have been "blocked" to the desired files, allowing them to send me a message asking to be let back in.

By putting the sections in this order, the permissions are:

* everybody get see everything everything, except those "blocked" (who can see nothing)
* but nobody gets the .htaccess file, regardless of being "blocked" or not
* but those "blocked" do get restored access to a short list of files

Thank you!

Eliz.

lucy24

1:49 am on May 24, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm pretty sure that I cannot use this "Order Deny,Allow" syntax

Sorry, my bad, I forgot that Order had been dumped in favor of a greatly expanded Require. So that's
Require all granted
for universal access. Other than that, the naming and arrangement of the <Files> envelopes is the same for, afaik, every Apache version ever.

I'm trying to accomplish this all within the .htaccess file.

Oh, OK, I got the impression all this was happening in config. The AllowOverride directive can't be used in htaccess at all. Just use <Files>, unless you really do have multiple files with the same name; if you do, you'll need to put supplementary htaccess files in the relevant directories. (Pro tip: If you have multiple htaccess files on the same site, include #comment lines in the main htaccess, reminding yourself where the others are located. Do as I say, not as I do.)

<FilesMatch "403.php|403oops.htm|403thank.htm|terms.htm|linking.htm|license.htm|trynguess.php|purple.png|logo.gif|mailit.php$">

All those . should be escaped. (This is a good example of a non-lethal error, since the chances are pretty minute that any non-period could occur in this location.)

Now, if it were me I'd make a couple of separate envelopes, like this:
<FilesMatch "(403|trynguess|mailit)\.php)">
blahblah

<FilesMatch "(403\w+|terms|li(nking|cense))\.htm">
blahblah

<FilesMatch "(purple\.png|logo\.gif)$">
blahblah
but at this point we're into individual coding style. If you really do have a multitude of possible 403 documents, I do strongly recommend a locution like
<FilesMatch "403\w*\.(htm|php)">
to cover all possibilities.

If all your error documents and their supporting files are located in the same directory, you could even dispense with the whole <FilesMatch> business and just put a one-line htaccess in that directory, saying simply "Require all granted".

wilderness

6:08 pm on May 25, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



<FilesMatch "403.php|403oops.htm|403thank.htm|terms.htm|linking.htm|license.htm|trynguess.php|purple.png|logo.gif|mailit.php$">


All those . should be escaped. (This is a good example of a non-lethal error, since the chances are pretty minute that any non-period could occur in this location.)


lucy,
Could you expand on this explanation?
The quotes is supposed to circumvent the use of the escape.

lucy24

11:40 pm on May 25, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



How can you circumvent the need for an escape while continuing to recognize pipes? They're part of the same language.

The quotation marks appear to be part of the <Files> or <FilesMatch> syntax. (For some reason I can't find the concrete explanation in docs.) In some situations quotation marks have additional meaning: for example in mod_setenvif, quotation marks allow you to use un-escaped literal spaces, which would otherwise have syntactic meaning. But that's not about regular expressions; it's specific to Apache.

In the specific case of file extensions, escaping or not escaping is not a huge issue.

lucy24

8:43 pm on May 29, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Follow-up: In the course of off-thread discussion, I figured out the one situation where you'd want to use a wild-card <Files *>. I'm going to post it here so I'm less likely to forget it again.

A universal <Files *> will override any earlier <Files> or <FilesMatch>, where "earlier" could mean either in a higher directory (most likely the config file vs. htaccess) OR earlier in the same <Directory> section or htaccess.

Conversely, a <Files> (of any kind, including wild cards) will not be overridden by any rule lying loose in a <Directory(Match)> or htaccess (whether the same or a later one). The new rule would have to be inside a <Files> or <FilesMatch> of its own to override the earlier one.

If, for example, you were feeling suicidal, you could say
<FilesMatch "^\.ht">
Order Deny,Allow
Allow from all
</FilesMatch>
in your htaccess, and that would override the config file's bar on people looking at your htaccess and htpasswd files. (Here I've deliberately made up the most ridiculously unlikely scenario ever.)