G'day from a new member interested in bots/crawlers/spiders

G'day all from Australia. I'm a new member interested in bots.

My humble website has been up since 1999. For most of the past two decades I've simply relied on the stats available in cPanel to sift the suspected bot/crawler/spider traffic from "real" traffic and thus give meaningful visitor stats. These stats were mostly fine for my needs, although there were gaps whenever AWStats crashed due to a huge traffic spike (my website is frequently posted on high-profile forums like Reddit).

More recently, my web host abandoned cPanel in favour of their own in-house hosting software. I'm now getting painfully inadequate traffic stats (top 35 pages only, with bot/crawler/spider traffic counted together with real traffic).

And so I embarked on the long, painful journey of writing my own software to read the raw Apache logs (which I've been keeping since 2011) to produce my own stats. There's undoubtedly existing software out there to do this – but I'm also a curious fellow and veteran programmer who decided it would make an interesting programming exercise.

The end result is some fairly good algorithms for distinguishing bots/crawlers/spiders from legitimate traffic. On average it seems that only around 5% of my traffic is from bots/crawlers/spiders. Not a big enough percentage to bother me greatly – except for the occasional rogue that produces thousands of hits on a single gallery script pursuing endless combinations of pages+sorts+filters+whatever.

Today, I'm more concerned with the ever growing number of hacking attempts – particularly those targeting Wordpress vulnerabilities. All I can say is that I'm thankful that my website pre-dates Wordpress!

I'm therefore surprised when I read in this forum that webmasters are going to great lengths to exclude bots/crawlers/spiders. One even mentioned that their bot/crawler/spider traffic was around 50% of their total traffic. That would certainly make blocking measures worthwhile.

Thus far I've only analyzed as far back as 2020 (when my web host switched away from cPanel), but I may run through earlier years at some stage (when I have more time). The problem is the sheer volume of data. Each day's log is typically more than half a million lines. The largest to date was a compressed file of 164Mb, which uncompressed to just under the 32-bit file limit of 4.3Gb, and which contained over 11 million lines. One can't simply eyeball that many lines manually, so it's a perpetual case of program-run-repeat.

My questions to the group:

1. Is anyone interested in lists of bots/crawlers/spiders that I've found over the years? Or are Lucy24's analyses sufficient?

2. Is anyone else out there with similar traffic likewise unconcerned because their bot/crawler/spider percentages are similarly insignificant?

Thanks in advance to anyone who cares to respond.

"Cache-Control": "no-cache", "user-agent": "WhatsApp/2.2338.9 W", "host": "www.example.com", "X-REWRITE-URL": "/path-to-url-shared/", "connection": "Keep-Alive", "Accept-Encoding": "gzip, deflate", "content-length": "0"

<meta property="og:title" content="ExAmple.com: Widgets"/> <meta property="og:url" content="https://www.example.com/"/> <meta property="og:image" content="https://www.example.com/path_to_image_display_in_snipet.png"/> <meta property="og:image:width" content="400"/> <meta property="og:image:height" content="400"/> <meta property="og:type" content="product"/> <meta property="og:site_name" content="ExAmple.com"/>

G'day from a new member interested in bots/crawlers/spiders

profshoelace

not2easy

profshoelace

not2easy

lucy24

profshoelace

lucy24

profshoelace

lucy24

profshoelace

SumGuy

profshoelace

lucy24

SumGuy

profshoelace

profshoelace

tangor

lucy24

SumGuy

SumGuy

blend27

lucy24

blend27

profshoelace

lucy24

profshoelace

lucy24

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week