Forum Moderators: open
It might be interesting to see if a public sitemap with a directory blocked in robots.txt manages to detect any such bots.It has occasionally occurred to me that if a given directory is disallowed in robots.txt, a certain type of robot might head straight for that directory. But this doesn't seem to be the case: malign robots don't even ask for robots.txt (except, sometimes, after a blocked reequest), let alone read it.
GET /sitemap.xml HTTP/1.1" 403 6896 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36"
GET /sitemap.xml
GET /wp-sitemap.xml
GET /wp-sitemap.xml
GET /sitemap.xml.gz
GET /sitemap.xml.gz
GET /sitemaps/sitemap.xml
GET /sitemaps/sitemap.xml
and so on, for a total of 34 requests (17 pairs). I assume the pairs are for with/without www and they didn't bother waiting for a redirect. Who knew there were so many possible places to hide a sitemap. In the time it took them to make all those requests, they could have spidered the site. <security>
<ipSecurity allowUnlisted="true" denyAction="AbortRequest">
<!-- Facebook IA Sraper -->
<add ipAddress="57.141.4.0" subnetMask="255.255.255.0"/> <!-- 57.141.0.2/24 meta-externalagent -->
<add ipAddress="57.141.5.0" subnetMask="255.255.255.0" /><!-- 57.141.5.0/24 - 12/08/2024 -->
<add ipAddress="57.141.7.0" subnetMask="255.255.255.0" /><!-- 57.141.7.0/24 - 12/08/2024 -->
</ipSecurity>
</security> There's also an aggressive little maggot out of Facebook on 57.141.20.0/24 and it didn't request robots.txt. It uses a meta-externalagent/1.1 UA. Looks like an AI scraper and behaves like one too.
Google has never requested my sitemap file. If its asking for yours, I'd like to know why.Possibly because I used to have one, and noted its existence in robots.txt * the way they tell you to. (Hm. Wonder if it would make a difference if I returned an explicit 410, as I would with any other removed content.)
es-419,es;q=0.9:: further run to logged headers ::
es-ES,es;q=0.9
If google is requesting your sitemap.xml file I'd like to hear more about that.
I have them entered in Google Search Console, that may be why.This sent me scurrying to GSC to check my Sitemap settings. Nope, just a Submit box with fill-in-the-blank after example.com. They also tell me the sitemap was last read in 2022, which would seem to be enough time to establish that it ain't there no more. (If it were a removed page I would return a 410, which does slow them down eventually, but for a sitemap I can't be bothered.)
The pages they request are kinda random, sometimes they request pages that are years old, where you'd think they'd just use the latest entries in the sitemap.Obligatory reminder: A sitemap doesn't mean “request only these pages”, it means “be sure not to overlook these pages”. Once a search engine has learned of an URL, they will keep requesting it periodically for years to come, or until the heat-death of the universe, whichever comes first.
They also tell me the sitemap was last read in 2022, which would seem to be enough time to establish that it ain't there no more.
"Accept-Language": "zh,zh;q=0.8",
....
"user-agent": "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.6261.94 Mobile Safari/537.36",
....
"sec-ch-ua-platform": "\"macOS\"",.... SetEnvIf Sec-Ch-Ua-Platform "Linux" lying_linux
BrowserMatch Linux !lying_linux
with a satisfying 50,000-odd lockouts in the past five-plus months (looks like I introduced it in December). I haven't yet got equivalents for Mac and Windows. In fact, further spot-checking suggests that nobody but robots--blocked on other grounds--claims to be "macOS", so it would be redundant. Trala.