Document scraping from unknown bot - Crawler, Spider, and User Agent ID forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

Document scraping from unknown bot

UA is Chrome/124, IP sources all Windstream ISP

SumGuy

1:31 am on Jan 2, 2025 (gmt 0)

Top Contributors Of The Month

Today I'm seeing a bit of a rash of scrapping of my pdf files. A pattern of HEAD (file) then GET (file). No referrer, no robots.txt. The user-agent is consistent, being this:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36

The language being en-US,en;q=0.9

Looking back, I see 1 example of this, looks to be the very first one, on Dec 28.

I first spotted this UA back in April 2024, normal web hits. I don't know how long that particular UA was current, but I've seen legit hits from it as recent as August, but not after. NOTE: There is another UA that is similar:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36 Edg/124.0.0.0

Note the Edg/124.0.0.0 on the end. That is not a suspicious UA.

The IP's I'm seeing this from are:

103.136.71.0/24
23.188.0.0/16
23.189.0.0/16
64.79.253.0/24
167.253.100.0/24

So yes, I'm seeing this from various IP's in those CIDR's. All IP's belong to Windstream (AS7029) a US-based ISP (East coast / NY ?).

My server is now handing out a "I think you're a bot" page when any requests come in that match that exact UA.

lucy24

5:12 am on Jan 2, 2025 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

:: obligatory detour to logs ::

My server is now handing out a "I think you're a bot" page when any requests come in that match that exact UA.

Your server’s judgement is probably right. The last human I find with this exact UA is from early October; since then it’s been nothing but robots. Some requesting pages alone, some images alone, and a striking red flag with

POST /test.hello?%ADd+cgi.force_redirect%3d0+%ADd+cgi.redirect_status_env+%ADd+allow_url_include%3d1+%ADd+auto_prepend_file%3dphp://input

I don’t have enough pdfs to make anything statistically meaningful, but it is worth noting that the robots get either pages or images, never anything else.

Does your “I think you’re a bot” page offer an option for well-meaning humans who got there by mistake? I’ve got a couple of pages that do this for various reasons, though I can’t remember any human ever following up ... possibly because the apparent offenders are all from non-English-speaking countries.

Kendo

12:16 pm on Jan 2, 2025 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

It is interesting that they are scraping PDF. A lot of authors publish in PDF. We have a client in Vietnam who publishes hard copies and customers who purchase those books can also read them online. One customer keeps buying his new books just to get the PDF to sell it online as their own.

SumGuy

2:29 pm on Jan 2, 2025 (gmt 0)

Top Contributors Of The Month

@lucy24, are you seeing these also from Windstream IP's, and also a HEAD then GET?

My "I think you're a robot" page tells them that they're using a proxy or VPN (so turn that off) or a suspicious or non-standard browser (so try a different browser) to access the site. I mention the proxy/VPN because I suspect that some of them might use custom UA's and that's what I'm seeing. But so far I have no idea if there have ever been human browsers behind any of these hits.

That page is not indexed or linked from any page on the site, so the legit search bots shouldn't be able to find or see it. I probably have 2-dozen UA rules now that trigger that page.

Also - I'm IP-blocking probably 1/4 to 1/3 of all IPv4 IP addresses in my router. A lot of that is country-based, and a lot is data-center based (all the usual suspects). So hits from those IP's aren't making it to my web server. It's going to be residential or business IP's from G7 countries that are getting the robots page.

lucy24

5:27 pm on Jan 2, 2025 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

@lucy24, are you seeing these also from Windstream IP's, and also a HEAD then GET?

No, I think there's more than one botrunner using this UA. (fwiw: At any given time, I have around half a dozen UAs flagged-and-blocked as “botnet_agent”--outdated but not absurdly antiquated humanoids--and then after some months they stop.) Conversely, on rare occasions I do see the HEAD+GET sequence--which is enough to mark a robot, because what human browser uses HEAD?--but not from this specific UA.

Lesson learned: There is no such thing as a genuinely unique robot :(

SumGuy

12:05 am on Jan 3, 2025 (gmt 0)

Top Contributors Of The Month

In the past 24 hours I've had the URL re-write rule covering this UA and it kicked in 3 times (all from AS7029 Windstream IP's). Each time the bot performed a HEAD but it got a code 200 return for the "you are a bot" html file instead of the file it was requesting, and it then did nothing, it went away, it did not try to GET the intended file.

lucy24

3:30 am on Jan 3, 2025 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Did you happen to notice if it was doing this with brand-new requests, as opposed to files it had requested before? I could envision a robot being programmed to check periodically if a previously retrieved file was still there, and then if the HEAD gets a 200 it doesn’t bother to re-download.

Except, hm, that seems more like a 304.

In any case, good to know that the robot isn’t getting anything new out of you. Tralala.

SumGuy

2:44 am on Jan 4, 2025 (gmt 0)

Top Contributors Of The Month

In my logs I see a small amount of GET's (usually from googlebot, maybe also bing) where the bytes transfered was zero and the code is 304. I don't know if I ever see HEAD from google or bing, I don't think so.

Currently what's happening for this new bot, it's asking for PDF files but my server is re-writing the request as (file).html, in the logs I see the path/(file).html file and the status code is 200, but the bytes transfered is zero.

This happened 6 more times today, again from AS7029. A pattern is forming for the requesting IP's.

lucy24

6:31 am on Jan 4, 2025 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

In my logs I see a small amount of GET's (usually from googlebot, maybe also bing) where the bytes transfered was zero and the code is 304.

Yup. Those are static files when the request was

:: detour to rfc ::

Oh, right, the If-Modified-Since header. If the resource is unchanged--as is usually the case with images and, for that matter, pdfs--the request is treated as a HEAD although the request was worded as a GET. In my logs I can go back and pinpoint when I first added php content (mostly around 2012, 2013) because that's when I stop seeing 304 responses to requests for html files. It's obviously efficient for things like search engines that just need to know that they have the most recent version of a file, without doing the extra work of getting it again. But this relies on having a vast database that keeps track of when you last fetched a given file.

The ordinary robot, like your current pdf-gobbler, will instead first send a HEAD--“Does this file exist?”--and then if it does exist, follow-up with a GET. I guess on the larger scale this is more efficient than racking up 404s when the file turns out not to exist. (I took a quick look at logs. The response to a HEAD is just a couple hundred bytes, as opposed to at least ten times that for anything that includes a 404 page.)

SumGuy

1:35 am on Jan 5, 2025 (gmt 0)

Top Contributors Of The Month

Another bot UA that I've just seen today that is focused on document scraping:

Mozilla/5.0 (iPhone; CPU iPhone OS 17_4_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4.1 Mobile/15E148 Safari/604.1

What I notice is that it was probably a legit UA back in April / May last year, and it always went first to HTTPS (never http). This bot version does try HTTP first.

FYI - the string fragment "15E148" is not, by itself, diagnostic. It first appears in Dec 2020 and continues to the present day in legit iPhone UA's.

This bot is directly requesting (GET) PDF's (on my site at least), no referer, then requests the favicon.ico file. No HEAD, no robots.txt.

One example today is from 129.222.85.194 (SpaceX). Spur id's it as a proxy.

SumGuy

2:37 pm on Jan 23, 2025 (gmt 0)

Top Contributors Of The Month

@reddighamburg -
My experience is overwhelmingly that bots (scrapers, proxy agents and other rogue actors) are incredibly inflexible and static and sadly obvious in terms of the user-agents they use.

When I see suspicious behavior (or rather, clearly bot/scraper behavior) and it's not coming from a data-center type of IP source (because otherwise I just add the entire ASN to my IP blocking list), I will look at the UA for interesting or peculiar string fragments, or check what version numbers it's using for chrome or firefox (looking for old or never-used version numbers) and I'll scan my back-logs for these fingerprints, looking for examples of false positive hits from actual humans.

In the current case that started this thread, the bot is peculiar in that it is using what I think is a residential / commercial ISP (Windstream AS7029) and about a dozen particular /24 IP ranges that I've begun blocking. This bot is the most sophisticated I've seen in terms of using legit user-agents - legit in terms of they were once current within the past year but from my log files not very popular, hence that makes them easily detectible.