Forum Moderators: phranque

Message Too Old, No Replies

Bot has found a way to bypass .htaccess?

Plus another Apache problem

         

thord

3:56 pm on May 3, 2025 (gmt 0)

10+ Year Member Top Contributors Of The Month



I am trying to block PetalBot by IP range, user agent and even referrer, all three methods simultaneously in .htaccess, but without expected success. Sometimes this bot gets blocked, sometimes not, and I cannot see any explaining pattern in the raw logs. This situation began a couple of years ago. In all other cases .htaccess still works as it should, and has for many years. Any ideas? Has Apache changed in some way?

(When I said blocking by referrer it begs an explanation. I have a parked domain example2.com being rewritten to example.com. PetalBot has somehow found out about this and is often using example2.com as a visible referrer when accessing pages of example.com. It looks like the rewrite rule is not functioning either, but still only in the case of PetalBot.)

Then the other Apache problem. A few years ago my web hotel changed the path to the cPanel login page. It is now simply cpanel.example.com. Thus bots can easily find and go to my cPanel's login page, but not further, of course. These hits on files and folders resting on the server one level above folder public_html are, however, showing up in my access log, which I find irritating. Can anything be done? The .htaccess file resides in the root of the public_html folder and does not work one level above.

not2easy

4:13 pm on May 3, 2025 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



When you look at your logs, do they show a 200 server response for the requests (hits)? It is not uncommon for bots to attempt to access but if the server response is 200, then you might want to talk with the host as they do (or should) have further controls above your folder public_html.

Have you checked to see whether your blocking efforts are compliant with the Apache version your host has installed? It is OK to post the .htaccess lines you are using for blocking, simply replace domain names with example.com or example.net etc.It is not uncommon to have minor errors that might prevent your efforts from working, sometimes depending on their placement within the .htaccess file.

thord

5:14 pm on May 3, 2025 (gmt 0)

10+ Year Member Top Contributors Of The Month



The .htaccess file is set to serve a 4** response to all unwanted hits. Sometimes PetalBot gets a 4**, sometimes they get a 200 to an identical request (for a different file, though). That is why this is so very strange.

It is not impossible that my .htaccess is in some way outdated. However, it is functioning perfectly all the time for all other rejected requests but PetalBot's. It is like Huawei has invented a method to circumvent blocks.

Maybe the start has changed from:
Options +FollowSymLinks
RewriteEngine on

lucy24

10:02 pm on May 3, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The .htaccess file is set to serve a 4** response
There’s no need to be coy; this is the Apache subforum. 403? 404? 418?

an identical request (for a different file, though)
If it’s for a different file, the request is not identical. The most obvious variable is the physical filepath.

An oddity of the PetalBot--assuming it's the real thing, not a faker--is that it appears to be robots.txt compliant. In all the time I’ve disallowed it, it has never requested anything else. Now, obviously robots.txt is not a means of access control--it’s a No Admittance sign, not a deadbolt--but it should always be a first step. The only thing better than a blocked request is a request that is never made in the first place.

Which reminds me: I hope you are allowing everyone to see robots.txt. Don’t give them an excuse for “Well, I tried to ask, but they wouldn't let me see it ::whine::”

Maybe the start has changed from:
Options +FollowSymLinks
RewriteEngine on
Very, very unlikely to be the problem. But if you need advice about an htaccess, you will have to be specific: What Apache version, what is the exact text of the access-control portion of your htaccess, and so on.

And finally: No. It is not possible for even the most malign of actors to bypass a configuration file, unless you have poked a hole for them, either accidentally or by design. Look elsewhere for a solution.

tangor

6:28 am on May 4, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



PetalBot has been robots.txt compliant for me for years. As forl the fakers a simple env of "petal" for the UA has always worked. Personally try to avoid IP is possible to keep things more manageable.

YMMV

thord

4:03 pm on May 4, 2025 (gmt 0)

10+ Year Member Top Contributors Of The Month



There’s no need to be coy

That assumption is wrong. I wrote 4** simply because I have been experimenting with 403, 404 and 410 to see if it makes any difference to PetalBot's hits. There is no difference.

the request is not identical

Do you really mean it can make a difference to how .htaccess functions if the request is for /a.html or for /b.html? It would be impractical (and probably not useful) to search the logs for two requests for a file /a.html, one where PetalBot got a 200 and one where it got a 4**. The problem was that PetalBot gets a 200 for /a.html and the next second a 4** for /b.html without visible reason. As I said, the .htaccess works and has for many years worked exactly as expected for all other requests than PetalBot's.

My site has always had a robots.txt file. But it is, and has always been empty. Malicious bots do not care and for welcome bots my site has nothing to hide. So there can be no whining.

lucy24

7:54 pm on May 4, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If it is literally a.html vs. b.html, then clearly some more investigation is needed. But if it's /filepath/a.html vs. /diffferent-filepath/b.html then it can make a big difference. That's physical filepath, not URLpath.

Meanwhile, I suggest you re-read all posts in this thread.

tangor

2:33 am on May 5, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



robots.txt can be a best friend even if it is NOT honored by BAD BOTS. The honorable bots WILL HONOR THOSE DIRECTIVES and can save a BOATLOAD of .htaccess coding.

EVERYBODY gets robots.txt, NO EXCEPTIONS, so they can see your desires upon access and that 200 indicates they DID find it.

Meanwhile, a blank robots.txt is useless, except for returning a 200 in the log file IF REQUESTED simply because the URL exists (without passing on any directives at all).