Forum Moderators: phranque

Message Too Old, No Replies

Rate limiting bad bots and attacks

         

csdude55

6:33 pm on Nov 13, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have one specific page that is under constant attack by bots. I honestly can't imagine why, the page has a form but they're not submitting the form! I just see like 100 connections per second to the page.

Any suggestions on rate limiting the connections to that page? Or, if it's better, rate limiting all pages? I'm open to methods that use PHP or Apache, and I have root access so I can work with configurations. But using cookies obviously doesn't work, since bad bots wouldn't store them.

lucy24

9:56 pm on Nov 13, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



How do legitimate humans reach the page? If they generally get there from somewhere else, you could look for the presence or absence of a referer.

For comparison purposes:
RewriteCond %{HTTP_REFERER} !.
RewriteCond %{HTTP_USER_AGENT} Chrome/108\.0\.0\.0
RewriteRule ^ebooks/(\w+/(\w+\.html)?)$ https://example.com/boilerplate/redirect.php?newpage=/ebooks/$1 [R=302,L]
The page “redirect.php” says, in effect, “I'm awfully sorry, but you have inadvertently replicated the behavior of an unwelcome robot” (because it is theoretically possible for a human to meet these conditions) ... and then it's got a link to the originally requested page. In this particular case--I've used it for a few others, including one category of actual humans--it's a robot that always uses the same user-agent, always coming in with a null* referer.

This doesn't, of course, stop them from making the request right away. But they tend to get bored and go away after a while.

The reference to a form intrigued me because I have long been plagued by various kinds of what I know as the “Contact botnet”. So far they have failed to notice that the Contact page no longer uses a php form. But it's only been about a year and a half since I simplified it, so I have not given up hope. (The current page instead uses a mailto: link but, since I never see spam to the address in question, that doesn't seem to be what they're looking for.)

* They send a blank referer, shown as "" in logs, as opposed to the "-" when no referer is sent. Unfortunately there doesn't seem to be any way to check ahead of time for “Referer header is sent, but is empty”.

csdude55

6:43 am on Nov 14, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Good thought! This specific page will ALWAYS have a referrer, so I added this to Apache CONF:

RewriteCond %{REQUEST_URI} ^/foo/bar/
RewriteCond %{HTTP_REFERER} !.
RewriteRule ^ - [F]

It's been about 30 minutes since I did that, and at 1am I'm still seeing them hit at about 100 per second and my server load is 3.5! I'm using this to track it:

tail -f -n1 /var/log/apache2/domlogs/foo/foo.com-ssl_log

The only thing that changed after making this change is that the response code is 403. At least, I'm assuming that's what this means:

3.87.229.113 - - [14/Nov/2023:01:13:48 -0500] "GET /foo/bar/ HTTP/1.1" 403 20 "-" "Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html) X-Middleton/1"

I don't have a way of blocking them at the firewall, because I'm using Cloudflare and now every connection comes from one of about 12 Amazon IPs! So while CSF (the firewall) does have a connection limit option, it would end up blocking those Amazon IPs instead of the bad bot.

lucy24

5:50 pm on Nov 14, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Oh, right. If the page is also visited by legitimate robots, you'd have to poke a hole for them. But if you don't want it to be an entry page, it doesn't matter.

But, remember: never put anything into a Condition that can go in the body of the rule. It's more efficient to say
RewriteCond %{HTTP_REFERER} !.
RewriteRule ^foo/bar/ - [F]
This is the syntax for htaccess or a <Directory> section; otherwise replace the ^ with whatever is appropriate for this specific config file.

csdude55

6:25 pm on Nov 14, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



In this case, the page is where the registered user would go if they've forgotten their password. I can't think of any reason that a legit user would start with that page. But this could potentially happen with any page, so I wish that there was a way to rate limit every page :-/

I kid you not, at least a few times a day my server load spikes to 60+! And it looks like it's always bots causing the problem.

I don't suppose there's a better option than [F], is there? I don't know if writing each request to foo.com-ssl_log is causing an increased in server load, but that's the only reason I can think of.

If not, is there a way to forbid it AND prevent it from being written to the log? That would at least let me see if anything else is going on.

lucy24

12:36 am on Nov 15, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I don't suppose there's a better option than [F], is there?
You could return a different error, such as 503 or 418 (“teapot error”, used by my host for mod_security) or, heck, 429 (“too many requests”). If they're really invasive, they may already be getting some of those naturally. In mod_rewrite the syntax is, counterintuitively, [R=418] or [R=503] or numerical code of your choice; no [L] is needed.

If not, is there a way to forbid it AND prevent it from being written to the log?
You did say you have access to the config file, right? If so, you could look into custom log settings (directive CustomLog under mod_log_config [httpd.apache.org] in the Apache docs). But really, the act of logging a request is probably the least of the server's problems, and it can be useful to have some kind of a record.

csdude55

1:45 am on Nov 15, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I forgot about 418! That was so funny back in the day! LOL

Do you think there's any real benefit to using it, though? Is one more likely than another to get bad bots to leave me alone?

csdude55

7:05 am on Nov 15, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Blah, I just learned that I can't use [R=418] :'-( I tried, but when I restarted Apache it gave me an error that it was an invalid response value. I'm using 429 now, but I'm still seeing a ton of requests so I guess it didn't really matter.

I stumbled across this 2012 article on rate limiting a page:
[johnleach.co.uk...]

He suggests using mod_security (or in my case, mod_security2), and then:

SecRuleEngine On

<LocationMatch "/foo/bar/">

# I'm using X_FORWARDED_FOR because of Cloudflare, and
# added the unique id: to each Sec line for mod_security2
SecAction initcol:ip=%{HTTP_X_FORWARDED_FOR},pass,nolog,id:11
SecAction "phase:5,deprecatevar:ip.somepathcounter=1/1,pass,nolog,id:12"
SecRule IP:SOMEPATHCOUNTER "@gt 60" "phase:2,pause:300,deny,status:429,setenv:RATELIMITED,skip:1,nolog,id:13"
SecAction "phase:2,pass,setvar:ip.somepathcounter=+1,nolog,id:14"

Header always set Retry-After "10" env=RATELIMITED
</LocationMatch>

ErrorDocument 429 "Rate Limit Exceeded"

I'm not getting any errors with this, but I didn't see any change in requests or server load, either. So I took it back down until I can test it better.

It's 2am and I'm fading fast, so tomorrow night I'll add SecRule back and then test the page to see if $_SERVER['RATELIMITED'] exists; that will confirm that LocationMatch is working, at least. If it is then I guess their HTTP_X_FORWARDED_FOR is changing on every request, too :-/

lucy24

5:50 pm on Nov 15, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



but when I restarted Apache it gave me an error that it was an invalid response value
Hm, that's interesting, since mine is also Apache. (But I suspect they haven't updated the mod_security rules in a while, since 418s are a minute number compared to 403.) Maybe there's something you have to change elsewhere in the config file to add 418 to the list of possible responses? In any case I doubt the robot really cares which 400-class response it receives; it just makes it easier to eyeball them in logs.

but I didn't see any change in requests
And the bad news is ... There isn't always any relationship between the response a robot receives today, and the request it sends tomorrow. With the obvious exception of things like legitimate search engines, which will quickly learn which URLs get 301 and 404 responses, and stop requesting them unless the old URLs are continuously reinforced by outdated links. I mean, you see requests for php-admin year after year after year, don't you.

About all you can do is handle the request in the way that is least troublesome for the server, even if all that means is that your 403 page is smaller than the “real” page.

:: wandering off to pore over a random day's logs and see if I can figure out why some 403s take 3000-odd bytes while others take 8000-odd (one or the other) ::

csdude55

7:51 pm on Nov 15, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I wouldn't mind if they would only send 1 request per second, but I've had tail -f -n1 running all day and it's like watching the Matrix code zoom by! Now I finally get why Cypher went crazy :-O

My server load right now is 4.8, and it looks like it's mostly from this stream of constant requests. "top" shows that I have 177 tasks, 1 running, 173 sleeping, and 3 zombie. I guess that means that the 1 running task is making a constant stream of requests?

The problem began after I started using Cloudflare, so I'm pretty sure that the issue is that the firewall isn't able to block bad IPs and connections the same way that it could before. I've gone through a million things trying to change that, but nothing seems to have any impact.

not2easy

8:27 pm on Nov 15, 2023 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Doesn't CloudFlare offer a captcha? There is a Cloudflare captcha on a service I log into nearly every day and it is just a checkbox that 'agrees' I am not a bot. Almost no delay.

csdude55

9:17 pm on Nov 15, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



They're not submitting the form, though, just connecting to the page and then leaving. I suspect that they're just reading the headers, but I'm not sure.

Just an educated guess, but I think that I need to find a way to block them before they reach Apache. The only way I know to do that is at the firewall, but that's not an option since all connections now show an IP from Cloudflare. I have an environment variable of HTTP_X_FORWARDED_FOR that shows the original IP (technically, the format is "IPv6, IPv4"), but I can't find a way to make CSF use it.

I also use CSF to block non-US IPs, but I'm not sure if it's actually working now.

csdude55

9:22 pm on Dec 1, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



For the sake of posterity:

When I said "the problem began after I started using Cloudflare", I should clarify that I began using Ezoic, who is a Cloudflare partner.

I recently created my own Cloudflare account, then linked Ezoic to that. This gave me options in Cloudflare that I didn't have in Ezoic; one of which being a "Rate Limit rule"! And that seems to have solved my problem. It doesn't block them, it just slows them down.

This was pretty nifty, the rule looks like:

If incoming requests match...
[URI Path] [starts with] [/foo/bar]

With the same characteristics...
[IP]

When rate exceeds...
Requests [2]
Period [10 seconds]

Then take action...
[Block] with response code [429]

For duration...
[10 seconds]


I also blocked non-US IPs and common exploit attempts, so now they're dying before they ever even reach Apache :-)

I also enabled their "Bot Fight Mode", which sends challenge requests that match patterns of known bots (including JavaScript detection). I had one person say that they were seeing this challenge on their iPhone, but when I asked for their IP or if they had any security apps installed that might be blocking JavaScript they stopped responding :-/ So I don't know if that's really just one anomaly having a problem, or if it's widespread and no one is saying anything! But I checked on my girlfriend's iPhone and didn't have a challenge, so I suspect it's an isolated issue.

Before this, my server load would occasionally spike as high as 138! But now it hasn't gone past 2.0 :-)

csdude55

7:17 pm on Dec 3, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I spoke too soon, I'm still seeing huge spikes :-/ Not as often as before, but they do still happen. Blerg.

thecoalman

6:42 pm on Dec 31, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




I don't have a way of blocking them at the firewall, because I'm using Cloudflare and now every connection comes from one of about 12 Amazon IPs! So while CSF (the firewall) does have a connection limit option, it would end up blocking those Amazon IPs instead of the bad bot.


Assuming you are restoring the real IP with an apache module for logs and other needs that is not the IP the firewall or CSF operates on. If you are not restoring the real IP it should not be appearing in your log files because that means they are circumventing the proxy.

Unless you have special case the only IP's accessing the origin over ports 80 and 443 should be Cloudlflare IP's. In CSF for ports 80 and 443 you allow Cloudflare IP's and block everyone else. This is a crucial step for DDOS mitigation if you ever need it.

As for AWS other than Duckduckgo I'm don't know of any other legitimate traffic. What I do know is that there is a tremendous amount of illegitimate traffic from their network. In the Cloudflare panel you can set up a rule to allow Duckduckgo IP's and block the rest of them.. If you are seeing illegitimate traffic from other networks <pointing my finger at OVH> you can just block by ASN.

[edited by: thecoalman at 7:21 pm (utc) on Dec 31, 2023]

thecoalman

6:54 pm on Dec 31, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Doesn't CloudFlare offer a captcha?


They offer many different options depending on how aggressive you want to be and it can be applied to just about any criteria imaginable.. Most of the legitimate traffic coming to my site will be from the US, specifically the northeast US. I have various blocks in place for user agents of bots that ignore robots.txt, ASN's of abusive networks etc. The very last rule issues a JS challenge to any request outside of the US or Canada, it's fairly seamless for legitimate traffic and they might get the "Checking your connection..." page. The solve rate for this is only about 2.7% for the last 24 hours and that is over many thousands of requests.