Forum Moderators: open

Message Too Old, No Replies

G'day from a new member interested in bots/crawlers/spiders

         

profshoelace

1:21 am on Nov 19, 2023 (gmt 0)

Top Contributors Of The Month



G'day all from Australia. I'm a new member interested in bots.

My humble website has been up since 1999. For most of the past two decades I've simply relied on the stats available in cPanel to sift the suspected bot/crawler/spider traffic from "real" traffic and thus give meaningful visitor stats. These stats were mostly fine for my needs, although there were gaps whenever AWStats crashed due to a huge traffic spike (my website is frequently posted on high-profile forums like Reddit).

More recently, my web host abandoned cPanel in favour of their own in-house hosting software. I'm now getting painfully inadequate traffic stats (top 35 pages only, with bot/crawler/spider traffic counted together with real traffic).

And so I embarked on the long, painful journey of writing my own software to read the raw Apache logs (which I've been keeping since 2011) to produce my own stats. There's undoubtedly existing software out there to do this – but I'm also a curious fellow and veteran programmer who decided it would make an interesting programming exercise.

The end result is some fairly good algorithms for distinguishing bots/crawlers/spiders from legitimate traffic. On average it seems that only around 5% of my traffic is from bots/crawlers/spiders. Not a big enough percentage to bother me greatly – except for the occasional rogue that produces thousands of hits on a single gallery script pursuing endless combinations of pages+sorts+filters+whatever.

Today, I'm more concerned with the ever growing number of hacking attempts – particularly those targeting Wordpress vulnerabilities. All I can say is that I'm thankful that my website pre-dates Wordpress!

I'm therefore surprised when I read in this forum that webmasters are going to great lengths to exclude bots/crawlers/spiders. One even mentioned that their bot/crawler/spider traffic was around 50% of their total traffic. That would certainly make blocking measures worthwhile.

Thus far I've only analyzed as far back as 2020 (when my web host switched away from cPanel), but I may run through earlier years at some stage (when I have more time). The problem is the sheer volume of data. Each day's log is typically more than half a million lines. The largest to date was a compressed file of 164Mb, which uncompressed to just under the 32-bit file limit of 4.3Gb, and which contained over 11 million lines. One can't simply eyeball that many lines manually, so it's a perpetual case of program-run-repeat.

My questions to the group:

1. Is anyone interested in lists of bots/crawlers/spiders that I've found over the years? Or are Lucy24's analyses sufficient?

2. Is anyone else out there with similar traffic likewise unconcerned because their bot/crawler/spider percentages are similarly insignificant?

Thanks in advance to anyone who cares to respond.

not2easy

3:59 am on Nov 19, 2023 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Hi profshoelace and welcome to WebmasterWorld [webmasterworld.com]

The reports from lucy24 are quite helpful for those who stumble across an unknown UA and wish to find out what kind of a critter it is. Those are not the only reports, it is common to ask what others have observed because what one user is fine with, another might wish to block.

We used to ask users to search and find whether a UA had been discussed before sharing so we wouldn't have dozens of similar topics but since we are currently using bing for site search here, you are as likely to be taken to some other site when you search for a UA as you are to find it here. Bing has an unusual concept of what "site search" is.

You don't need permission to post information about UAs, but in general, lists of them are not much help without the behavior they exhibit. I don't think that many of us would block based only on a UA. If we see they ignore robots.txt, that is useful. If they show signs of being a distributed UA, that might be helpful to know.

Great to know you have set up a way to analyze your logs, yes that saves a lot of time and effort once you have a routine that works for your needs.

That Welcome link opens in a new tab and helps explain some settings and features of the forums here. The Charter for each forum helps you know what is and isn't acceptable, no mysteries. The Charter is on the drop-down menu at the top of the thread labeled "Forum Options".

G'day to you profshoelace.

profshoelace

4:37 am on Nov 19, 2023 (gmt 0)

Top Contributors Of The Month



Righto, thanks not2easy. I've had a look through both of the links you provided and indeed there's some helpful guidelines there for newbies like me. As this is my first every "reply", I'm not sure if this will "attach" itself correctly to your message, but if not, then mentioning your name should at least help.

I get where you're coming from with supplying more info than just the bare name of the bot/crawler/spider. It would be easy enough for me to add the offender's IP range to any list that I might produce. But documenting their behaviour might be more work than I'm willing to devote. Manually checking what each of around a thousand different identified offenders actually did would be somewhat labour-intensive!

I'm still curious about the effort that other people expend blocking bots/crawlers/spiders. Truly "good" bots are easy to identify, relatively easy to control and are generally welcome. Truly "evil" bots can be very difficult to identify and control and are totally unwelcome – yet in my case they create a small enough load over and above the "good" ones that it doesn't seem worth pursuing them.

As someone who is quite happy tinkering with .htaccess, I love the idea of blocking some of the worst offenders. I fear, however, that all I'd ever accomplish would be to block some of those relatively benign ones that lie somewhere in between "good" and "evil" – for little overall benefit. Or am I reading this wrong?

not2easy

1:53 pm on Nov 19, 2023 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I think most of us that use these forums are using various parts of .htaccess to deny, some can be done outside of .htaccess, especially if you manage the server and can edit httpd.conf to help. Good and Evil are relative, I like to use the terms 'beneficial' or 'unwanted'. At any rate it is good to know you have familiarity with editing your .htaccess files.

Which UAs are beneficial to you (or unwanted) depends on how they interact and their purpose in visiting. When analyzing your logs you might wish to look into a visitor's activity while they are visiting. Regardless of UA, a visitor that requests a html file without its attached resources (or conversely, the resources without the html) is not likely to be of benefit to your efforts. These are usually non-human visitors even if they visit with a human browser UA. Changing UA is easy in most browsers so should not be the only factor to analyze.

One way to 'automate' that process is to check request headers. Human visitors send headers that are different from scripted visitors. Because lucy24 started collecting this information years ago, she has her own set of requirements to determine wanted and unwanted behavior. On an ecommerce site you might develop a different set of requirements. But these are some things that are generally best developed using your traffic to determine rather than snippets to copy and paste into .htaccess files.

I'm sharing a few links to related discussions. These are not fresh (2017 - 2018) so they do not use the most recent syntax for Apache, but they do explain it well.
How to Check Header Fields (5/17) at [webmasterworld.com...]
Getting started: header-based access controls (6/18) at [webmasterworld.com...]
There are other threads, most of them in that general time frame, some older. I used site search and "check headers" to dig these up, there are more. ;)

lucy24

7:00 pm on Nov 19, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Getting started: header-based access controls
Ooh, I'd forgotten that thread--and, in particular, had forgotten how many years ago it was.

Any thread more than a few years old will probably cite the Allow/Deny syntax used in Apache 2.2 and earlier. By now, everyone should be on 2.4, using “Require” syntax. If you have a batch of existing rules, they can all be updated with a quick RegEx or two ;)

profshoelace

12:59 am on Nov 20, 2023 (gmt 0)

Top Contributors Of The Month



Thanks, not2easy, for the helpful links to relevant discussions. I've just spent most of this morning reading through these and others.

Considering my website's ancient ancestry, I only have about a dozen .php pages compared with more than 900 static .htm pages (my CMS is all custom-written off-line code). It's mostly only truly dynamic pages like image galleries that are written in .php. This does limit my ability to do logging and inspect headers – but is probably sufficient. After all, the bad actors generally try to grab everything in sight, so they'll eventually hit one of those logging-enabled pages.

The "good" robots+crawlers+spiders generally identify themselves, either with "bot", "crawl" or "+http" in their user agent string. For those that don't identify themselves so clearly, header logging will undoubtedly help build my understanding.

I'm still expanding my 2023 list of user agents. The stats I'm getting as a result are so much more meaningful. Even so, my robot+crawler+spider percentages are still only between about 3% and 8%, so eliminating them is not a huge priority. My website is mainly informational (think Wikipedia), so there's nothing critical, no e-commerce, no private data, no logins – you get the idea. The only issue to me is wasted bandwidth + polluted stats.

Thanks also, lucy24, for the .htaccess advice. Currently my .htaccess rules are mostly focussed on core stuff:
• Redirecting parked domains to main domain
• Redirecting http to https
• Handling renamed pages (using "RedirectMatch permanent")
• Handling deleted pages (using "RewriteRule")

Thus far I've only ever added one referer-based rule (many years ago) for one particularly persistent leech:
RewriteCond %{HTTP_REFERER} ^.*example-leech-domain-name.*$ [NC]
RewriteRule .*\.(jpg|gif|png)$ - [F,NC]

The newer .htaccess syntax with "SetEnvIf" sounds worthwhile, so I may look at rewriting any with older syntax, plus adding suggested stuff like:
SetEnvIf User-Agent ^$ no_agent
SetEnvIf Accept ^$ no_accept
followed by the "require env" code within "requirenone" wrapper.

Which brings up another point: This won't actually eliminate the problem, only alter the response, correct?
• The recipient will get a 40x response instead of the requested resource.
• My server still serves up a (possibly smaller) file.
• Most of my website is static files, so there's actually more load on my server for it to load the php parser in order to serve the 40x.php page.
• My Apache logs still contain one line for that access.
• My stats code (which reads those Apache logs) still tallies that bad access – just in a different category (ie. "40x Return Code" vs. "Suspected Robot").

Is it possible to instead serve up nothing? Close the connection? Doing so would eliminate all of the five dot-points above. But doing so may be considered bad practice.

I'm always conscious of trying to stay focused on the big picture (adding content) rather than incidentals (back-end stuff). That's why I'm wary of venturing too far down this rabbit-hole unless – as others have indicated – the robots+crawlers+spiders potentially account for as much as 50% of traffic. I find it hard to believe that I've only ever been seeing the tip of the iceberg – but if so, I'd certainly like to be proactive in melting that particular iceberg!

lucy24

2:35 am on Nov 20, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This does limit my ability to do logging and inspect headers
It doesn't, really; you can include a php footer in an otherwise static-html page using SSI (Server-Side Include). No, you don't need to change all the page extensions to .shtml; just add a directive in htaccess.

Handling renamed pages (using "RedirectMatch permanent")
Combining mod_alias (Redirect by that name) and mod_rewrite will not break anything, but most people will strongly advise against it because things won't necesssarily execute in the order you want them to. Luckily it's simple to change all your mod_alias rules to mod_rewrite syntax.

Most of my website is static files, so there's actually more load on my server for it to load the php parser in order to serve the 40x.php page.
Did you explain at some point why it has to be php? There are several alternatives to a 403 page, including returning a literal string that you specify in htaccess.

Although it's pretty uncommon for an unwanted robot to be robots.txt compliant, it never hurts to Disallow them by name. If you're concerned with server load, the only thing better than a blocked request is a request that is never made in the first place.

In any case: If you're working with htaccess it presumably means you're on shared hosting. If the host can't handle a simple php page, it may be time to go shopping. Years ago, someone hereabouts (iBill? keyplyr?) counted up the number of separate server calls required for a single WordPress page--just the page, not its supporting files--and came up with something like 30. So, with a combination of homegrown CMS and hard-coded html, you are already way ahead of the game.

profshoelace

3:38 am on Nov 20, 2023 (gmt 0)

Top Contributors Of The Month



Thanks once again, lucy24. I've never experimented with Server-Side Include. Years ago I had every .htm page also calling a .php script (a banner script). In those days it required the "include" to be on each page.

I'm also aware that I can instruct Apache to treat each .htm page as a .php page and render it accordingly. My problem, however, is the unique difficulty of having a website that occasionally gets 50,000+ visitors per day, sometimes with hundred(s) of simultaneous instances, which web hosts really don't like – particularly with shared hosting. Combine this with the fact that the website really doesn't generate any significant income – certainly not enough to pay for dedicated hosting that can handle this.

Luckily, my site's minimalist origins have kept it running when most other heavier sites would have fallen over years ago. However, this breaks down once the website also has to load a .php server for each page access. That would more than double the load that is otherwise spent simply delivering the static content.

True, my 403 page doesn't really have to be a .php, but that's just me being meticulous.

My robots.txt page does indeed disallow a whole bunch of robots by name, plus restricts all bots from accessing certain directories / parameters.

lucy24

4:54 pm on Nov 20, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



In those days it required the "include" to be on each page.
My personal Final Frontier was allowing BBedit to do an unsupervised multi-file replace to add the appropriate line to each and every page. This would have been at least ten years after I first met Regular Expressions; it takes time to work up the courage.

profshoelace

12:19 am on Nov 21, 2023 (gmt 0)

Top Contributors Of The Month



lucy24, it's interesting to hear your broad range of experience. I'm guessing that – like me – you've been a web developer for decades?

At this stage I probably won't implement any .php scripts site-wide. No matter how I look at it, having each page execute even a snippet of .php will place double the load on the server compared to simply serving up a static page. I will, however, add that snippet to the few .php pages that are already in place.

For your interest, when I first decided to create a comprehensive list of bots+crawlers+spiders, I began by writing a program to read all of the Apache logs for a whole year – a total of 128 million lines – and extract all of the unique user agent strings. You can probably guess the result – Version 1 resulted in a massive output file with hundreds of thousands of unique entries.

Version 2 replaced all embedded version digits (eg. Mozilla/5.0, Firefox/60.0) with simplified hashes (Mozilla/#, Firefox/#), while Version 3 added several further simplifications:
• More comprehensive list of prefixes with trailing version numbers;
• Strings like "Mastodon" and "Pleroma" that commonly add a random domain name as a suffix, which was replaced with "[random-domain]";
• Removed spaces before delimiters (but not between delimiters) and concatenated multiple spaces.

The output file was now a much more manageable 2.97Mb with 27,197 entries.

Version 4 got more ambitious, replacing all Android device names (around 6,000 unique names) with a generic "[android-device]", while Version 5 did the same with iOS devices. These two changes reduced the output file by another 75% to only 0.79Mb and 8,021 entries.

It was then easy enough (although time-consuming) to eyeball the lot in a spreadsheet and figure out the best way to scan for obvious bots+crawlers+spiders.

I'm now at the stage of working on Version 6 – figuring out the not-so-obvious ones. My path has thus diverged from pure user-agent identification and onto tracking behavior. It's a challenge!

SumGuy

2:51 am on Nov 21, 2023 (gmt 0)

5+ Year Member Top Contributors Of The Month



Have you ever looked at the IP's hitting your site and tried to separate the cloud / hosting / proxy IP's vs end-user (likely human user) IP's? I'm assuming that beyond useful search engines that you'd rather your site be surfed by people rather than scraped and probed by bots. ?

profshoelace

4:25 am on Nov 21, 2023 (gmt 0)

Top Contributors Of The Month



I'm sure that everyone would rather their site(s) be surfed rather than scraped. Sadly, the latter is an ever-increasing problem for all of us.

The IP route does seem promising. It's a breeze to include an ever-growing blacklist (and/or whitelist) of IP ranges into .htaccess.

But I'm curious how far one can take this before it gets out of hand. Do sites inevitably end up with (black|white)lists thousands of lines long? How much does this slow each request as Apache checks through such a lengthy list? Does the list in itself become an admin challenge to maintain? I'm pretty curious to hear the experiences of anyone who's gone down this route.

lucy24

6:04 am on Nov 21, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Do sites inevitably end up with (black|white)lists thousands of lines long?
Oh, absolutely not, except in the occasional case of an unfortunate newbie who doesn't know about IP ranges and tries to block by exact-to-the-last-digit addresses.

Some years back, I switched to primarily header-based blocking. I've had to add back some IPs, but the “Require ip” section of my access controls is less than 100 lines (out 600 or so total, I just counted), mainly devoted to colos or hosts that are especially bad about sending humanoid robots. The file would be still smaller except that I'm pretty permissive about which robots are allowed in.

Edit: Granted, too, my sites are much smaller than yours. In practical terms, my main site's Apache logs rarely get plumper than 1MB: so small, that I can process them in javascript. (Because I haven't the energy to learn another language, alas.)

SumGuy

11:53 pm on Nov 21, 2023 (gmt 0)

5+ Year Member Top Contributors Of The Month



I host both a mail and web server on my company servers since about 1998. My internet connection is a single static IP.

For most of that time I've been focused on IP-blocking as a way to control (and essentially eliminate) email spam.

On the web-server side, my server handed out a few 403's based on IP on a daily basis for many years, but it's only been the last 8 years that I've been really looking at my web logs, and implimenting router-based IP-based blocking starting only in about 2018 or 2019. In the past 4 or 5 years of looking in detail at where my web hits are coming from I've seen (and blocked) increasing numbers of IP addresses.

I do all my IP blocking in my router, so my daily web logs are nice and clean and small.

I have 3 IP blocking lists in my router. Anything that I want my router to log goes to a syslog server (on a raspberry pi).

List 1 -> drop everything (regardless of port), and don't log

List 2 -> if the destination is port 25 (email) and the source IP is on this list, it gets dropped (and logged)

List 3 -> if the destination is port 80 or 443 and the source IP is on this list, it gets dropped (and logged)

I have no IPv6 address (that I care to use or impliment) so these are all pertaining to IPv4. There are 3.681 routable (usable) IPv4 addresses that are theoretically usable today. So for my math, 100% = 3.681 billion IP's.

The size of List 1 (my really bad, block them and don't log them IP's) is 16.8% of the internet. 21.4k CIDR's.

The size of List 3 (drop these IP's, they won't see my web server, but log them) is 21.4% of the internet. 55.8k CIDR's.

List 3 does not overlap List 1. So when added together I'm preventing 1.4 billion IP's or 38% of the internet from hitting my website.

Part of what goes into List 1 are incoming unsolicited packets on ports other than 25, 80, and 443. Port scanning bots, pings and other ICMP packets, etc. I would see upwards of 30k log entries per day of the router dropping those sorts of packets, and now it's down to a few hundred.

The smallest IP block I impliment is a /24 CIDR. I get pretty much all of my CIDR entries from the Hurricane Electric BGP lookup tool, and use on-line tools to combine / condense the list entries.

profshoelace

12:45 am on Nov 23, 2023 (gmt 0)

Top Contributors Of The Month



Thanks, lucy24 and SumGuy, this feedback is invaluable! It gives me some sense of the scope of both the problem and the solution. There's no sense in tackling a problem that is too small – nor implementing a solution that is too large.

As someone who loves numbers, I'm fascinated by some of the stats you've given, SumGuy. You're blocking some 38% of the Internet? That's some serious dedication putting that many blocks in place – and, presumably, keeping those lists maintained! Unless, of course, you simply leave blocks in place indefinitely until you're notified of a problem? On a company website that's the obvious solution – and would have been my preferred route if not for my website being designed for worldwide consumption.

I'm also envious of your ability to tackle this in the router. Unfortunately, being on shared hosting, I don't have access to either the router or to httpd.conf – so any solution I implement will have to be via .htaccess. As someone who also loves efficiency, I'm pleased to hear that Lucy24 is managing this with such a relatively compact list of rules.

By the way, Lucy24, I get where you're coming from with not wanting to learn new languages all the time. I've been programming in a variety of languages all the way back to assembler in the 1970s, so I sometimes use "old" but nonetheless versatile tools. Folks may be surprised that my current analysis of user agents for 2023 was done using good old fashioned Excel with its inbuilt Visual Basic. It took just under an hour to process almost 65Gb = 200 million lines of Apache logs.

profshoelace

1:29 am on Nov 23, 2023 (gmt 0)

Top Contributors Of The Month



Just a quick side-note: Is it just me, or is "Cubot" a really ill-conceived smartphone brand name? My own robot analysis searches for user agents containing strings ending in "bot" – but then needs to specifically exclude "Cubot". I'm seriously tempted to remove that hard-coding!

tangor

3:08 am on Nov 23, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm in a pretty good stage of webmaster life: I know which bridges to burn and which to cross.

For me it is the heavy hitters and the less desirable that need to be dealt a broad brush. All the rest are cyclic in nature, ever shifting in identifiers, and often useless a few months from now as that ever expanding effort by bad actors to disguise and pilfer continues.

Only a few are "etched in stone" ... and the rest is simply "noise". Headers does the heavy lifting, IP ranges are very effective. When only the best will do xxx.xxx.xxx.xxx !

lucy24

6:17 pm on Nov 23, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



When only the best will do xxx.xxx.xxx.xxx
Or possibly
xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx
(though I haven't actually seen any of these yet) if you have an IPv6 address.

I keep these in a separate block of htaccess, for the rare but annoying case where some individual computer has been compromised so you get an avalanche of robotic hits from an otherwise human IP. Re-check in a month or two and they will probably be gone.

SumGuy

12:55 am on Nov 24, 2023 (gmt 0)

5+ Year Member Top Contributors Of The Month



I could approach IP-based blocking from a few different angles and post pages of experiences and numbers pertaining to trends over time. But I think it's always going to boil down to this:

There are large entities that are assigned hundreds of thousands if not million's of IP's that a website owner/operator would get no tangible benefit in getting hits from (in the best case) and would suffer some harm from these hits (in the worst case). A "short list" of the big players that will give you the biggest bang for the buck in terms of keeping your IP-blocking list short and cover a lot of IP space, is as follows:

Amazon, Microsoft and OVH IP's assigned in the 52.x.x.x and 54.x.x.x IP range, China (Alibaba), Colocrossing, Google Cloud (34.x.x.x, 35.x.x.x), Leaseweb, more Amazon and Microsoft (3.x.x.x, 13.x.x.x, 18.x.x.x, 20.x.x.x, 34.x.x.x, 50.x.x.x), Hetzner, Digital Ocean, more OVH, M247, Unified Layer, Tencent, Softlayer, Datacamp, White label coloco, Network solutions hosting, Internet Vikings, GleSYS, Choopa / vultr.

Then there are country-based IP's assigned to residential / retail and business end users that have a high degree of infected devices (routers mostly I think) that are used as proxys to get to your website (or relay spam, etc). To the extent that you do (or want to, or don't care to) get useful traffic from retail ISP's in these countries, these are my picks for countries worthy of IP-blocking (because they DO generate a lot of garbage hits): China, India, Brazil (or a good chunk of latin america and south america), South Korea, Malaysia, Philippines, Iran, Russia, Ukraine, a handful of other eastern-european and south-east asian countries. Oddly I don't see a lot of bad webhits from Mexico, but I do from South Korea.

But your first line of defence would be to first block the cloud IP services from Microsoft, Amazon, Google, Hetzner, Digital Ocean, OVH, Tencent, Alibaba, M247, then the second and third tier hosting / cloud providers. But do understand that google-bot, bing-bot, yahoo-slurp, yandex, internet-archive, duck, all those search bots have full access to my site. As a side note, I block fecebook and tiktok from accessing my site.

I would rather not block legit human individuals to browse my company's website regardless where in the world they are. But what we sell is used by scientists, doctors, engineers in the biotech field and can cost upwards of $30k and we have sold to mainland China in the past but generally it's going to be the G7 countries, have never sold to India and I've written off China in terms of future sales so I block all Chinese IP's (using large CIDR's) when they pop up in the logs.

My website is not tied into any ad network, there are no links to googletag or any external domain, and I serve no cookies (so you won't see a pop-up message on my site asking if it's ok or not ok to track you or force you to agree to cookies). It's a practically static site, not much more sophisticated than a yellow-pages ad that you'd see in a phone book, but a lot of resource material (PDF files). It's designed to be discovered through a web search, browsed, and then a phone call or email contact. Not configured for e-commerce.

SumGuy

1:06 am on Nov 24, 2023 (gmt 0)

5+ Year Member Top Contributors Of The Month



I should add that in terms of blocking based on user-agent, this is what I'm currently doing:

Block (give them a 410 code) if the user-agent IS:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0

or if the user-agent contains this:

LRX21T
MRA58N
OPD3.170816
59.0.3071.115
69.0.3497.81
90.0.4430.85

That's it. That's all I do.

blend27

4:49 pm on Nov 24, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@profshoelace - Welcome to WebmasterWorld

You mentioned:
SetEnvIf Accept ^$ no_accept

I'd assume you talking about absence of Accept Header here. I'd be careful with that one. From what I understand you have Tons Of Traffic, that means people might be sharing you site/urls via Different Messengers/Apps as well.

Here are the headers for the latest version of WhatsApp
"Cache-Control": "no-cache",
"user-agent": "WhatsApp/2.2338.9 W",
"host": "www.example.com",
"X-REWRITE-URL": "/path-to-url-shared/",
"connection": "Keep-Alive",
"Accept-Encoding": "gzip, deflate",
"content-length": "0"

Why this might be important. Apps like that generate a preview/snippet of your site based on OG:tags if you have any:
<meta property="og:title" content="ExAmple.com: Widgets"/>
<meta property="og:url" content="https://www.example.com/"/>
<meta property="og:image" content="https://www.example.com/path_to_image_display_in_snipet.png"/>
<meta property="og:image:width" content="400"/>
<meta property="og:image:height" content="400"/>
<meta property="og:type" content="product"/>
<meta property="og:site_name" content="ExAmple.com"/>


If you block based on Lack of Accept Header, the user will see just URL in their App.

If you don't block it and basically have four OG (Open Graph) meta tags in the <head> section in place, USER will see the image and Description in their App which drives traffic to your site.

Extra Tip: For requests from Apps like WhatsApp, create a separate template(you can even do PHP that creates dynamic OG tags) for those REQUESTs to serve just that part(wrapped in HTML). Less Data Sent back to the App - less bandwidth and Greener Planet for us all. Also takes care of scrapers that pretend to be WhatsApp

lucy24

6:04 pm on Nov 24, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If the element “WhatsApp” is part of the UA string, you can easily unset the variable, in the form
BrowserMatch WhatsApp !noaccept
Any header-based access controls will have plenty of hole-poking of this kind.

Incidentally, I've never seen a WhatsApp UA requesting anything other than html--and the ones that aren't blocked get only a 206. So I'm not sure what kind of a preview they can generate.

Edit: I believe the 206 response is because the request includes
Range: bytes=0-299999
(first 300K only? how fat do they think my pages are?)

blend27

9:38 pm on Nov 24, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Sí lucy, Sí...!

-- (first 300K only? how fat do they think my pages are?)

It is up to maestra to make'em as "enjuto" as possible.

But yes - somewhat, to get OG data they need to hit .html(whatever is on the menu - now I am thinking leftovers from yesterday, Again...), but to display image next to it it had to come from somewhere - ¿sí o no.

I do not see "bytes" header from WhastApp - Food for thought for other thread.

profshoelace

11:01 pm on Nov 24, 2023 (gmt 0)

Top Contributors Of The Month



Just returned to this discussion to discover a bunch of new and interesting replies.

SumGuy: Great detail regarding the problematic IP ranges. Blocking cloud-based providers sounds a perfect first line of defense for my purposes because no end-user will likely be visiting from the cloud. Country-based blocking is out for me because I welcome visits from everywhere (shoelace advice is universal) even if the real people from those countries are only a small percentage compared to the bots.
Your user agent blocking is very concise. Most of your blocks appear in my logs – and look innocuous – which warrants further analysis on my end.
You also gave me my "laughoftheday" with your "fecebook".

blend27: Thanks also for your reply. Interesting point against blocking requests without an "Accept Header". My website doesn't have OG meta tags, relying instead on traditional meta tags for "thumbnail" and "description". It's annoying to discover that yet another overlapping standard exists! Either way, if a website tried to retrieve the resources linked in those meta tags, the same blocking problem would occur, so I'll definitely rethink blocking on lack of "Accept Header". I may, however, try to figure a way to flag such requests for my back-end analysis so that I can better separate "views" from "previews".

lucy24: Your solution to the above WhatsApp example is indeed simple.
As for "fat pages", my own largest .htm file is only 100k, while my largest .php page is around 400k on the server (it includes its own complete php-based database) and only around 50k when served. But I guess you and I are fairly unusual in keeping our pages so lean.

lucy24

5:21 am on Nov 25, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



:: detour to dictionary (yes, I have a whole shelf* of physical, printed dead-tree dictionaries) ::

“thin, skinny”

They’re fairly enjutas on the whole (no, I am not going back to the shelf to confirm that web pages are grammatically feminine), with the exception of ebooks which are, of necessity, as big as they happen to be, though I do chop up the fatter volumes. I get uneasy when a natural page goes over 50k. And even books don't go over 1MB. (I find one exception at a bloated 1.2MB, but it's the General Index to a six-volume set, so the weight is all in links. No illustrations worth mentioning; it all averages out.)

* Actually a shelf and a half, but the Greek, Latin and Sanskrit dictionaries have to share space with the cat combs and spray bottle. And the OED, like grandfather's clock, is too large for any shelf, so it has stood forty years on the floor.

profshoelace

11:27 pm on Nov 25, 2023 (gmt 0)

Top Contributors Of The Month



lucy24, I never thought that the word "lean" would trip up anybody – especially someone so well equipped with dictionaries! I guess I'm showing my age because when I think of the opposite of "fat", it invariably triggers the memory of an old nursery rhyme, which I learned as:

"Jack Sprat could eat no fat, his wife could eat no lean.
And so betwixt the two of them they licked the platter clean."


My fattest web page (100kb) is a text-heavy one containing feedback from hundreds of visitors who have learned how to avoid the "Granny Knot", such that their shoelace knots no longer come loose. It hardly seemed worth paginating. Otherwise, I'm fairly happy with my median web page size of around 24kb.

Back to the topic at hand: I'm making slow but steady progress working through the past couple of years' user agents. One relatively new thing is the expanding "Fediverse", with recent appearances by Akkoma, Firefish and Calckey, along with older user agents like Friendica, Mastodon, Misskey, Lemmy, Pleroma, etc. Each is sort of like a distributed variant of other social media giants like Facebook, Instagram, Twitter, or perhaps WhatsApp, making them difficult to categorize. How are others treating these user agents? Where do they stand in the continuum between "human" and "robot"? And if we do consider them bots, are we destined to add new entrants indefinitely?

lucy24

12:19 am on Nov 26, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I never thought that the word "lean" would trip up anybody
Haha, no, it was the “enjuto” in blend27’s post that tripped me up :)