what do they want?

Forum Moderators: open

what do they want?

lucy24

9:23 pm on Jan 29, 2026 (gmt 0)

Granted, not so much an ID question but--my favorite subject--a robot psychology question:

For many months now, I've noticed a particular robotic behavior: a cluster of requests for some random page (html only), in the range of 5-20 in rapid succession, all from different IP and UA. I tend to doubt any kind of DDoS exploit, as there would be more of them, closer together, likely resulting in a 429* code. Best guess: infected human machines, as most come from broadband IP ranges all over the world--with a slightly higher proportion of countries that I don't ordinarily see much of--with the occasional colo/server thrown into the mix. Since there is no unifying feature, no distinctive headers, all I can do is temporarily block the IP (generally /24 for 3 months), with the happy result that at least half of any given cluster gets a 403.

Question: What the ### do they want? Why isn't it enough to request a page just once?

* The 429 response only started showing up in logs a couple of years ago, probably as a byproduct of one of the host's periodic server changes. I've never asked, but I think they are true 429, “too many requests”, as would happen if you bombarded a lot of different sites living on the same server.

lucy24

4:40 pm on Mar 23, 2026 (gmt 0)

Is it possible we are talking at cross purposes?

Header logging, like analytics, is included in the footer of each page--including, by design, any error documents, so the only time I see headers for an image request if it's been blocked.

Checking whether any given header exists at all, or what its value is, does not require a php script; it is easily done with SetEnvIf directives in htaccess.

The only issue is whether headers come in a particular order, or a particular combination. In order to use this information for access control, I would have to begin by rewriting all requests to a php script, which performs additional work before issuing a response and sending out the page. This would in turn make access logs useless, since all requests are seen by the server as 200 (request was successfully rewritten to gatekeeper.php).

Meanwhile--here comes the good news--I've found that in some cases robots are becoming too smart for their own good. For verisimilitude they often include a referer, most often a search engine or the site root. But it does not do much good to give suchandsuch as the referer for some specific URL when I, as the site owner, know for a fact that the putative referer does not actually link to the URL in question.

jmccormac

11:01 pm on Mar 23, 2026 (gmt 0)

Just going through the logs after an outage. Most of the scrapers are AI vermin from China and also from Microsoft ranges. That Cloud serviice should be permanently deepsixed and only genuine Bing IPs whitelisted. There seems to have been a shift from the Huaweii/Aceville/Tencent ranges as they may be getting blocked globally.

That Code200 gang is still active and most of the IPs are blocked here. I think it was about 800 ranges. There ws also a lot of mobile ISP IPs that got caught caught. These requests have a pattern in that they are all requesting old data (stats) on hosters. The fresh data (hosting history for domain names) is less rarely requested. The residential requests are always to deep pages with either no referrer or the request as referrer.

Just saw another scraper request from Chile as I was typing this. There is a lot of activity from LAC countries.

Regards...jmcc

blend27

3:18 pm on Mar 24, 2026 (gmt 0)

@lucy24, @all

let me throw this one out there:

Sites I code are custom CSM based sites that are hosted on latest IIS Servers, usually shared hosting with 100-ish or so other sites on the same box. Unless there is DDOS against a specific, that is picked up by hosting company, there is realty no significant overhead in using dynamic language scripting that is connected to DBMS with 10ns of queries per page. With the largest Query to DBMS that is against an indexed view that scans ip_RANGES_table(1341613 rows) / left joint / asn_table(77180 rows) that costs me 0.021 seconds. The rest of the queries cost me almost 0s time-wise the final page is spout out to the browser. The DBMS is located on a separate Box at the hosting server with about 100-ish other databases, so no sweat there.

Now this is the way the request is processed for efficiency sake and in the following order;

There are 3 levels of software in this Firewall that are in place:

LEVEL I: IIS web.config
LEVE II: .htaccess
LEVEL III: ColdFuson/DBMS

Level I - Highest level. web.config(with rules(URLRewrite2 for IIS) that point to maps).
At this level I filter out/drop/rewrite requests that can easily spotted by some simple logic based on

a) KNOWN as AWS, Oracle, MSFT, TENCENT, Major Cloud Scraping providers etc.., etc.. IP Ranges are dropped outright, and I don't give a pack of flying Quality Diamond File Steel Quenched 6 Pcs Finishing Tools about it.... 0 bites returned - no responce code, can't connect.

b) ...previously blocked IPs caught and written to a map file by rules that can for example, if anything, is requesting "wp-URIs" or .php-and-such in URIs*, those IPs stay in the map file for a month. On a first try, if caught, request from IP is rerouted(rewritten/not redirected) to another app that is not accessible via web from outside, this way main website application code/server load is not effected.

^* basic rule: pattern="^.*(config|wp-|rss|\.php|\.git|\.inf|\.asp|\.aspx|\.pl|\.cgi|-cgi|\.env|env|\.aws|lavarel).*" = rewrite to another app logic, many other rules there too. sql ingections, super old UAs, UAs that don't make sense anymore/anyless, etc.....

Logic in that other application at first writes to main map file so the next request is dropped, then looks up IP in DBMS and to determine if it is a known Hosting Range and if found simply increments the counter of blocked requests from that range. If NOT found looks up AbuseDB via API(Free when registered) to see if it is Hosting IP, if it is blocks that IP Range entirely in DBMS(review range later), no play games here. Request Blocked, Range Blocked, ASN Flag is on Alert.

c) you name it....

LEVEL II (if you got here - you good, but wait for lev3): No Headers are inspected here. Basic Rewrite/Point/Redirect request to the main app folder based on what is requested and some other blackmagicphuckery...

Notice that Main App has not seen a packet of traffic yet.

LEVEL III (Welcome to ColdFusion):

We Inspect the HTTP Headers with data from getHttpRequestData() : no good(many rules) = request is blocked and written to DBMS for later inspection.

We make a call to "Blocked IPs" in DBMS, see it it there; seen you before this past 6 month = no cheese, logged into DBMS.

So at this point we still don't know where the IP came from(Country, Hosting, WhiteListed Bot). WhiteListed Bot(and there are only select few get a soft pass, unless they are trying to get to Robots.txt disallowd content). Others we do a lookup, same query by CF to DBMS. Hosting or manually blocked range? = less than 16 bytes sent back(403). Not in Country list = 206 with either a Captcha(self written, polite) or an "accesso negato in inglese". RU, KR, CN, IR, IL, SA... etc traffic is not recorded(blanked 403).

-------------------------------------------------------
All Requests at this point written to DBMS. Back end App has A ColdFusion lookup utility that can be used to filter by:

Date From/To(goes back to 2006), IP/RDNS/ASN all free form, UA(free form, 48260 to chose from that visited my sites) + UA Type(Mobile/Tablet/Desktop),Browser Resolution, Country List(incl/exclude), if Image/CSS/JS Enabled, URL Visited(also includes and OLD Url being 301ned to knew URI, including WhiteListed Bots), SITE/URI Referrer. That is for Allowed Traffic.

Blocked requests also include what Bought them the Status of such.: Country, Bot Trap, BAD UA, Bad URL, Bad QueryString(sql injections and Such), HTTP Method(no HEAD, No PUT, ok I will stop...), Guest Book Spam/Form Submission without previously established session, Self Referrer OR Referrer is from http:// version of the same site OR let say they pretend to click on imaginary link that is missing on a particular page(ye, that is also tracked, i keep my ship tight ;) ).

Sooooo..., IP Data, now(ever since SumGuy mentioned it, THANK YOU SumGuy) comes from IPInfo, I download DB, parse data IPRanges by country, +separate table for ASN from the same data linking them together. Import data into DB, run a script to split it, update allowed countries, ASN Blocks(hosting ranges based on that). Before that if was a similar automated script that got the data from FTPs of Regional IP Data(ARIN and such) providers and maintained a separate list of Hosting ranges.

Now here is the Sweet spot: whenever the request comes in and whether it is blocked or allowed, ranges for that are written to a top level lookup tables, this way I ask data from those tables first, only if it is not there, or lets say ASN is coming as Blank from IPInfo tables, I would make an API Call somewhere else to determine what is what. Hundreds of a millisecond if that.

Database is Normalized to the Teeth. I personally hold a certification from MS for SQL Server and from Oracle from back in a day. MySQL is not my thingy.

AdbuseDB FREE API calls are also full of juicy info: usageType >> Data Center/Web Hosting/Transit field info for example and Spam Confidence/Total Reports fields, I also contribute programmatically...

This is IPV4 Traffic only. Data maintained over 16 websites, a lot going on.

This code base started when Florida Update hit(hard, very) and I started slaying MFA scrapers(its is their fault, I swear on a neighbor's squeaky piglet).

I have not utilized any other statistics on applications on my sites since 2005, write it myself and do not share Visitors Data with big Boys. Customers who request outside implementation of statistics like Googs must pay more.
----------------------------------------------------------------

and then I can see it all LIVE, in the back-end App, ;)

I learned most of it in this Forum and WebmasterWorld, Thank You.

...blend27

;

jmccormac

11:44 am on Mar 27, 2026 (gmt 0)

Interesting setup. The RIRs are not necessarly a reliable source on the country of IPs. Some of them are redelegated and there are some operations that are apparently US owned ranges that are really Hong Kong/Chinese owned ranges. (Sheridan, Wyoming seems to be the base for a lot of them.) Also, Alibaba uses US delegated ranges. There has also been a massive uptick in vermin from Microsoft ranges in the RIPE RIR. Previously, it had been 20.bbb.ccc.ddd and 40.bbb.ccc.ddd ranges. There are also /8s like Cogentco's 38.0.0.0/8 that are leased and are widely used outside the US.

It is possible to use geofeeds for more precise geolocation. This is the URL for a composite geofeed:
[geolocatemuch.com...]

Cogentco's geofeed is on
[geofeed.cogentco.com...]

Then there are the SC and MU ranges in the RIR lists that are really used by businesses in Hong Kong and China.

I run a website IP survey for the gTLD websites each month and have it down to just 99.55% identifed web hosting providers on 197.58 million websites. It is different to what some of the IP database vendors and sites do in that it correlates multinational hosters across their IP ranges and identifies the hoster (DNS) providing service for these providers.

On the blocking, I am not familiar with Windows Server and if it has a kind of iptables equivalent. Is it more efficient to use a database query to block by IP than just block at IP level?

Regards...jmcc

jmccormac

12:29 pm on Mar 27, 2026 (gmt 0)

There's also an aggressive little maggot out of Facebook on 57.141.20.0/24 and it didn't request robots.txt. It uses a meta-externalagent/1.1 UA. Looks like an AI scraper and behaves like one too.

Regards...jmcc

jmccormac

12:37 pm on Mar 27, 2026 (gmt 0)

Just on the random/pseudo-random URL requests, most large websites will only make sitemaps available to the main SEs and it could be an attempt to probe for the existence of directories and URLs. The multiple queries for the same URL suggest a botnet that is trying to evade detection. If one IP is blocked, then the others might get through. It might be interesting to see if a public sitemap with a directory blocked in robots.txt manages to detect any such bots.

Regards...jmcc

lucy24

5:35 pm on Mar 27, 2026 (gmt 0)

It might be interesting to see if a public sitemap with a directory blocked in robots.txt manages to detect any such bots.

It has occasionally occurred to me that if a given directory is disallowed in robots.txt, a certain type of robot might head straight for that directory. But this doesn't seem to be the case: malign robots don't even ask for robots.txt (except, sometimes, after a blocked reequest), let alone read it.

In the case of the sitemap-plus-robotstxt combo, are you thinking of some wholly imaginary directory for honeypot purposes? It wouldn't affect legitimate search engines--though they will probably start b###ing and moaning in Wemaster Tools that robots.txt prevents them from crawling suchandsuch page. In any case, I tossed my sitemap several years ago. No need for it, as everything is linked, and newest pages are linked directly from the front page.

:: detour to raw logs ::

Huh. G and B continue requestin sitemap.xml every few days, under the head of I’m always a cockeyed optimist. And then there's this, from just over a year ago*, from a blocked range:

GET /sitemap.xml HTTP/1.1" 403 6896 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36" 
GET /sitemap.xml
GET /wp-sitemap.xml
GET /wp-sitemap.xml
GET /sitemap.xml.gz
GET /sitemap.xml.gz
GET /sitemaps/sitemap.xml
GET /sitemaps/sitemap.xml

and so on, for a total of 34 requests (17 pairs). I assume the pairs are for with/without www and they didn't bother waiting for a redirect. Who knew there were so many possible places to hide a sitemap. In the time it took them to make all those requests, they could have spidered the site.

<topic drift>
After instituting a 403 involving extensionless URLs (which I don't use), I was ecstatic to see that none of these newly blocked requests was followed by a request with final /. That means they really were following redirects, rather than requesting with-and-without directory slash all at once. Trala.
</td>

* Because I can't see, I first thought it was from earlier this month, and couldn't understand why I couldn't find the requests in headers (logged via the 403 page).

blend27

5:49 pm on Mar 27, 2026 (gmt 0)

-- On the blocking, I am not familiar with Windows Server and if it has a kind of iptables equivalent. Is it more efficient to use a database query to block by IP than just block at IP level? --

iptables equivalent I use in web.config, this will just drops the request all together, not even in the IIS Logs file, before it gets to App, DB:

<security>
 <ipSecurity allowUnlisted="true" denyAction="AbortRequest">
<!-- Facebook IA Sraper -->
 <add ipAddress="57.141.4.0" subnetMask="255.255.255.0"/> <!-- 57.141.0.2/24 meta-externalagent -->
 <add ipAddress="57.141.5.0" subnetMask="255.255.255.0" /><!-- 57.141.5.0/24 - 12/08/2024 -->
 <add ipAddress="57.141.7.0" subnetMask="255.255.255.0" /><!-- 57.141.7.0/24 - 12/08/2024 -->
 </ipSecurity>
</security>

database query: A site with 10 million views per month is probably in need of a dedicated server anyway. A site with 100,000 unique visitors will do just fine on a shared server. Again, properly optimized DB = no sweat, be it MySQL, SQL Server or any other DBMS.

lucy24

5:55 pm on Mar 27, 2026 (gmt 0)

There's also an aggressive little maggot out of Facebook on 57.141.20.0/24 and it didn't request robots.txt. It uses a meta-externalagent/1.1 UA. Looks like an AI scraper and behaves like one too.

Oh, I've seen this occasionally. So far never from 57.141.20, but from elsewhere in 57.141 with that UA. Since I've flagged the meta-thingy as bad_agent, they get the minimalist robots.txt that Disallows everyone. Or rather, they would, if their requests--whether for individual pages or robots.txt--didn't one and all receive a 429. (Mystifying. I though this was generated by the server, meaning they would never reach my site and hence my error documents, but I do find logged headers with 429 status.) On closer inspection, looks as if they requested robots.txt (only) sporadically from April to July of last year, after which they rested for a few months and then came back with rare requests for isolated pages ... and just one robots.txt, from a few weeks ago.

SumGuy

3:41 am on Mar 29, 2026 (gmt 0)

> Huh. G and B continue requestin sitemap.xml every few days.

Google has never requested my sitemap file. If its asking for yours, I'd like to know why.

I began to see requests for sitemap.xml in December 2020 from 13.66.139.x (search.msn.com). At the time I didn't have that file on my site. They stopped in Feb 2021. In May of 2021 I saw a handful of requests from 34.134.236.83 (googleusercontent.com) but I still didn't have that file ready to go.

In May of 2022 I began to get continuous requests for sitemap.xml from 40.77.167.x and 157.55.39.x (bingbot) and they were pretty consistent until August when I finally did create a sitemap file. By October 207.46.13.x (search.msn.com) started asking for it. At some point in 2023 52.167.144.x (also search.msn.com) started asking for it.

In June of 2025 216.73.216.x (Anthropic) started asking for it.

I get lots of hits from 74.7.175.x (Open AI) and it does ask for robots but it's never asked for sitemap.

Googlebot has never asked for sitemap.xml. If google is requesting your sitemap.xml file I'd like to hear more about that.

Yandex, Internet Archiver, DuckDuckgo have never asked for sitemap.xml.

I've never seen any evidence that bot-scrapers have clued in and asked for (and then follow) sitemap.xml, which to me indicates that there really is no phenomena of bots that just want to vacuum up a site in a systematic / complete way (at least not using the sitemap plan).

And lastly I've never seen hits I could attribute to Grok / Xai or DeepSeek at all.

SumGuy

3:56 am on Mar 29, 2026 (gmt 0)

The current header accept-language strings that I'm now rejecting:

es-419,es;q=0.9
es-ES,es;q=0.9

It's like they're playing a game.

lucy24

5:34 am on Mar 29, 2026 (gmt 0)

Google has never requested my sitemap file. If its asking for yours, I'd like to know why.

Possibly because I used to have one, and noted its existence in robots.txt * the way they tell you to. (Hm. Wonder if it would make a difference if I returned an explicit 410, as I would with any other removed content.)

But I'm surprised they don't periodically ask for sitemap.xml in any case. They ask for ads.txt daily, even though I have never had such a file and have given them no reason to think I would.

es-419,es;q=0.9
es-ES,es;q=0.9

:: further run to logged headers ::
Gosh, what a lot of them. I'll have to take a closer look. What, if anything, does 419 mean? I don't see it with anything but Spanish.

* Here I took a quick look at robots.txt to make sure I hadn't simply forgotten to remove that line. Nope, all good.

jmccormac

11:29 pm on Apr 12, 2026 (gmt 0)

Sorry about the delay in replying. Main Windows box got banjaxed by a power outage and wouldn't boot. Spent some days recovering the HD and installing the data on another Windows box. The HD crapped out on that one after about 3 days. Hopefully, this one will survive longer.

@Lucy24 Yep, it was a bot trap idea.

@Blend27 Impressive setup. Database queries might be the way to go. At the moment, I am using a detect and deepsix the range approach. There are quite a few ASNs blocked and some of the scraper networks have also been blocked. This is partially due to the work on the IP/Website surveys that identify the range owners. There are some really troublesome scraper networks spread over a number of RIRs. A lot of the Singapore activity has now shifted to MSFT and those ranges. There is a group of MSFT ranges in RIPE that all use the netname "cloud".

Bing's sitemap downloading software should have a 304 option because it stupidly puts a lot of pressure on servers by downloading the same files in parallel. It was necessary to temporarily block it a few years ago by feeding a 304 result for each request.

Will reply in detail later today.

Regards...jmcc

Kendo

3:28 am on Apr 13, 2026 (gmt 0)

If anyone can block an ASN on a Windows Server, can you please share your method?

jmccormac

12:58 pm on Apr 13, 2026 (gmt 0)

Not sure about Windows Server. The standard way to grab the ranges is to use a whois query ( in Linux) for the ASN. This alternative may be useful. It is a daily update of all ASNs in JSON and text format:

[github.com...]
It is a tarred and gzipped archive (.tar.gz)

Today's number of ASNs is 96,928 and the archive expands to 1.5GB. There are text format ranges for IPv4 and IPv6. The owner is hashed out at the top of each file. I am not sure about scripting in Windows Powershell but importing those lists into a MSFT db table should be easy and then an SQL query could convert the ranges to a minimum IP and a maximum IP. That could produce an ASN, range, minimum IP and maximum IP. It might be necessary to convert the IPs to a numerical format. That way it would be possible to query the db or ASN table to see if an IP (numerical format) was in a particular ASN. There is an IP range aggregator program in Linux that could be used to reduce sequential IP ranges to larger IP ranges though I am not sure if there is an equivalent for Windows Server. It also possible to check individual ASNs according to the page.

Regards...jmcc

blend27

9:15 pm on Apr 14, 2026 (gmt 0)

--RE: github link.

ON WindowsServer IIS/MSSQL Platforms Only.

Thanks for the link @jmccormac. I just downloaded that .tar file, unzipped it to local drive: Each ASN has its own folder(96,934 numerically funky named Folders) that comes with 3 files. I looked at aggregated.json that has the info for ASN/Ranges/Type/etc in each folder.....

Wrote me ColdFusion Script(Keeping Up with what I know) to save time later on.. to generate a file with a list of .json files in those folders.

Wrote me a script in TSQL for BulkInsert that reads that list of .JSON files and inserted contents of each into DB Table. Parsing contents of those JSON does not make sense, but writing a query that has this line: "hosting" does, which will be ASN as Hosting Range at the time..... just my 2 pence, I mean I can go into JSON Datatype in MSSQL, but why, right?

^^but importing those lists into a MSFT db table should be easy and then an SQL query
right, right...Right?

If you go with None .JSON files from there, you get ASN + ranges, when U missing COUNTRY flag.

RIPE before, now ARIN - Each /24 can have A Funky /28 or a clunky /32 within that one that was 'prostituted = prosti menya babushka' to somewhere in 'say Pe.dodorovka Village' as in NOT-WHAT 'could be used to reduce sequential IP ranges to larger IP ranges', and just like that, Carrie was out of options, even in the City.... ...

It IS a good thing, seeing fragmented /24's makes you think what is the next move -1 (Tap Left != taP Right => Ya'Legs are crooked => 'prosti menya babushka')

It is a big mess, it is.

Fun Though!

jmccormac

9:48 pm on Apr 14, 2026 (gmt 0)

RIPE cab be a bit of a nightmare with subnets. There are some in ARIN as well but not to the same extent from what I can see. Some of the ISPs have a combination of city ranges and also some much smaller customer ranges (below /24).

The country codes are also "variable" in terms of accuracy. With the data from the RIRs (main lists), the minimum for the lists is a /24. The lists also provide SC or MU as the countries for one of the largest IP leasing operations. Most of the ranges are used in HK rather than the African region. Then there are large ranges where the ownership is disputed. in terms of /24s, some ranges can have a handful of websites and others can have the characteristics of load balancers. Sites that are self-hosted on broadband are a lor more common than people realise and where there is a lack of infrastructure or expensive infrastructure, they can be quite common while the web developers and hosting companies host outside that country's IP space.

I think that I posed a link above to a site that aggregates the geofeed data. It is in CSV format:

[geolocatemuch.com...]

It is quite good on the country level ranges. It is generally a case of using multiple sources (RIRs, geogeeds and other sources) to geolocate IPs. Even Google gets it wrong at times.

Regards...jmcc

jmccormac

10:33 pm on Apr 14, 2026 (gmt 0)

That JSON metadata is nice. I am not sure about the reliability of the categories. I checked one of the ISPs here and it shows as "hosting". Checked another ISP and it was properly categories. The country applies to the ASN rather than to the IPs. Just checked an Indian hoster's IP and the ASN is a US one. A lookup on the IP shows it as an IN IP. Might be an interesting exercise to take the ASN data and correlate it with the actual IP range country data.

Regards...jmcc

jmccormac

10:36 pm on Apr 15, 2026 (gmt 0)

I've just been dealing with a scraper botnet. It is using ISP/mobile proxies rather than datacentres/web hosters. This can make it somewhat more difficult to block. Amazon is still hammering away in the background but gets nothing.

The data that they are attempting to scrape is chronological (directories by year). There are links to other pages within the webpages. The scrapers do not individually follow these links. They seem to be distributed over the botnet. And there may be more than one botnet in operation. The Chinese/Singapore AI scrapers are also active but have been blocked at an IP level.

Some of these botnet members only try to scrape a single page. That might be why they are so diffiult to detect. There may be a split between the detection of links to follow and the actual links that the botnet members attempt to scrape. Each may have a set of URLs and not necessarily on the same site. The curious attempts at adding variables to the end of an URL may be an attempt by the scraper to try to get links on sites that may not follow an obviously logical (alphabetical, topical, or chronological) link architecture. Some of them may be attempts at link injection or similar.

Regards...jmcc

lucy24

4:50 am on Apr 16, 2026 (gmt 0)

adding variables to the end of an URL

On the plus side, this can provide another criterion for blocking.

jmccormac

9:57 pm on Apr 16, 2026 (gmt 0)

It would help identify the probes which may then have the URLs fed to the botnet members.

Did a rough calculating on the number of IPs involved in that attack.
13/April/2026 48,242
14/April/2026 22,395
15/April/2026 11,389

Some problem countries and ranges were blocked over the days.There were a lot of Amazon IPs and they got nothing. The majority were residential md mobile ISP IPs. Each seemed to be only requesting a single page (direct URL).

Regards...jmcc

shawnb61

7:45 pm on Apr 27, 2026 (gmt 0)

If google is requesting your sitemap.xml file I'd like to hear more about that.

Google requests my sitemaps once a day. I have them entered in Google Search Console, that may be why. I don't know if they'd go looking for them otherwise.

Is it USING the sitemap? I believe so, when they are caught up... They seem to fall behind a lot in recent months, where if you browse around GSC, you might see errors along the lines of 'discovered not indexed', or, messages that things are not current 'Due to internal issues...'. Or if you look at the 'last update' date, e.g., in the upper right corner, it's a few days old.

When they're current, they are on top of the latest entries in the sitemap. I can find that day's topics in google search. And given the relatively low volume of requests, I believe they're using the sitemap.

When they fall behind, though, not so much... The pages they request are kinda random, sometimes they request pages that are years old, where you'd think they'd just use the latest entries in the sitemap. It would clearly help them be more efficient, but oddly, that's the very time they don't appear to use it...

lucy24

8:07 pm on Apr 27, 2026 (gmt 0)

I have them entered in Google Search Console, that may be why.

This sent me scurrying to GSC to check my Sitemap settings. Nope, just a Submit box with fill-in-the-blank after example.com. They also tell me the sitemap was last read in 2022, which would seem to be enough time to establish that it ain't there no more. (If it were a removed page I would return a 410, which does slow them down eventually, but for a sitemap I can't be bothered.)

The pages they request are kinda random, sometimes they request pages that are years old, where you'd think they'd just use the latest entries in the sitemap.

Obligatory reminder: A sitemap doesn't mean “request only these pages”, it means “be sure not to overlook these pages”. Once a search engine has learned of an URL, they will keep requesting it periodically for years to come, or until the heat-death of the universe, whichever comes first.

shawnb61

8:31 pm on Apr 27, 2026 (gmt 0)

Understood. There's a weird broken logic to it, though, that has me rubbing my chin... When things are going well, they clearly use it & get right to the latest. It's all quick & efficient.

One would think, when they KNOW they are falling behind, they'd go the most efficient route. Spend their resources wisely. But that doesn't appear to be the case. It gets LESS efficient, and no longer appears to use the set of recent updates easily found in the sitemap.

It doesn't bug me that it occasionally looks at old stuff, however randomly... It bugs me that it stops looking at the new stuff, readily available...

shawnb61

8:36 pm on Apr 27, 2026 (gmt 0)

They also tell me the sitemap was last read in 2022, which would seem to be enough time to establish that it ain't there no more.

There is a 'Remove Sitemap' function hidden up there behind the 3 dots in the upper right corner. Next to the 'Open Sitemap' link, only when you've drilled into the sitemap. Might be worth a try if you missed it.

blend27

1:56 pm on Apr 28, 2026 (gmt 0)

Another Nice Combo:

"Accept-Language": "zh,zh;q=0.8",
....
"user-agent": "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.6261.94 Mobile Safari/537.36",
....
"sec-ch-ua-platform": "\"macOS\"",....

Not the language, all though,... but the Platform VS UA..

158.222.113.137 << AS19084 — ColoUp <<< Marked as Privacy

Oakey Then....

lucy24

4:26 pm on Apr 28, 2026 (gmt 0)

You had me at zh, though I make an exception for zh-tw.

:: cursory visit to logged headers ::

Oh, right, I must have dealt with this at some point, because I’ve got this pair

SetEnvIf Sec-Ch-Ua-Platform "Linux" lying_linux
BrowserMatch Linux !lying_linux

with a satisfying 50,000-odd lockouts in the past five-plus months (looks like I introduced it in December). I haven't yet got equivalents for Mac and Windows. In fact, further spot-checking suggests that nobody but robots--blocked on other grounds--claims to be "macOS", so it would be redundant. Trala.

And yup, somewhere along the line I permanently blocked 158.222.112.0/20

blend27

9:40 pm on Apr 29, 2026 (gmt 0)

somewhere along >> AS19084 >> ColoUp >> coloup.com

162.223.88.0/21
162.223.88.0 - 162.223.95.255

162.245.80.0/21
162.245.80.0 - 162.245.87.255

104.222.32.0/20
104.222.32.0104.222.47.255

158.222.112.0/20
158.222.112.0158.222.127.255

This 88 message thread spans 3 pages: 88

what do they want?

lucy24

lucy24

jmccormac

blend27

jmccormac

jmccormac

jmccormac

lucy24

blend27

lucy24

SumGuy

SumGuy

lucy24

jmccormac

Kendo

jmccormac

blend27

jmccormac

jmccormac

jmccormac

lucy24

jmccormac

shawnb61

lucy24

shawnb61

shawnb61

blend27

lucy24

blend27

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week