Forum Moderators: goodroi

Message Too Old, No Replies

sitemap.xml vs sitemaps.xml vs sitemap.txt vs sitemap index.xml

why is bingbot asking for all of those?

         

SumGuy

1:20 am on Aug 20, 2022 (gmt 0)

5+ Year Member Top Contributors Of The Month



As part of making my website interact with bingbot more effectively and efficiently, I recently started using the bing webmaster tools console. As part of that, I created sitemap.xml, which I've noted in the past that bingbot has asked for, and now it can have it.

In addition to sitemap.xml, I've just noticed that bingbot is (today) asking for

/sitemap.txt
/sitemap_index.xml
/sitemaps.xml
/sitemap.xml.gz

Looking back through the logs, I note that bingbot has asked for that cluster or sequence of files about 43 times over the past couple of years, the vast majority happening just in the past couple of months, mostly before I created sitemap.xml.

And unless I've botched something, my logs indicate that requests for sitemap.anything (which only ever come from a bingbot IP) are historically very rare and recent (only past 2 or 3 years) and have really ramped up starting May this year. I even see a single request for website-sitemap.xml (last year).

I can understand that sitemap.xml.gz is just a compressed version of sitemap.xml.

But I can't understand (or I don't know) if sitemap.txt, sitemaps.xml and sitemap.xml serve different purposes (and hence could contain different content) or if bing is just covering all file-name possibilities and is essentially just looking for a single file (ie sitemap.xml).

Because bing is finding sitemap.xml on my site (now) but is still asking for the others - why?

phranque

2:39 am on Aug 20, 2022 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



i would assume bingbot is trying all the most commonly used sitemap filenames in hopes of discovering additional content.
if you serve a sitemap.xml it would be quite common for the crawler to request the compressed version just in case.

lucy24

4:18 am on Aug 20, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Is the sitemap named in robots.txt? The syntax is something like
Sitemap: https://www.example.com/sitemap.xml
Not all search engines know to get the name there, but bingbot ought to.

:: quick detour to check logs of two sites, one using sitemap.xml and one using sitemap.txt, to confirm that bing only asks for the one named in robots.txt ::

SumGuy

11:40 am on Aug 20, 2022 (gmt 0)

5+ Year Member Top Contributors Of The Month



Interesting. I will put that Sitemap line in my robots file. Maybe bingbot will stop asking for the other variants after that?

Do other search bots ask for the sitemap file if it referenced in robots.txt? I should check this, but I don't think googlebot ever asks for sitemap, but maybe if it's in the robots file it will?

Edit: ok, it will:

[developers.google.com...]

SumGuy

12:04 pm on Aug 20, 2022 (gmt 0)

5+ Year Member Top Contributors Of The Month



Regarding if you don't want your sitemap file showing up in search-engine results, is the following (from several years ago) still valid?

[webmasters.stackexchange.com...]

X-Robots-Tag: noindex

I assume you put that noindex line in the sitemap file?

not2easy

1:10 pm on Aug 20, 2022 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



To add it to the headers you do not add it to the document. Google explains how to add it to the headers in .htaccess or httpd.conf for Apache and NGINX here: [developers.google.com...]

For a single specific file such as sitemap.xml you would use
# the htaccess file must be in the directory of the matched file.
<Files "sitemap.xml">
Header set X-Robots-Tag "noindex"
</Files>
in the .htaccess file for Apache.

lucy24

5:05 pm on Aug 20, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Do other search bots ask for the sitemap file if it referenced in robots.txt?
I took a closer look at logs for one site that used* sitemap.txt instead of the more common sitemap.xml. Logs suggest that, yes, most mainstream robots will focus on what they find named in robots.txt. The most visible exceptions are BLEXbot and, amusingly, something calling itself "ultimate_sitemap_parser/0.5". (I'd never heard of the latter, probably because it got a steady diet of 403s.)


* “Used” in the past tense because earlier this year I decided sitemaps aren’t worth the trouble--if humans can find everything, so can search engines--and simply deleted them from all sites.

Sgt_Kickaxe

9:01 pm on Nov 23, 2022 (gmt 0)



I don't think search bots care much for known URLs in sitemaps.

They probably appreciate finding URLs they didn't know about and all of the "last updated" information if it's not on the page, though.

Because of caching mistakes some sites publish articles that don't show up until the cache is refreshed, except in the sitemap.

Still, if you're on top of such issues, the sitemap isn't likely very helpful. Now that you've submited it let us know if Bing keeps looking for the other locations after it's been crawled.

soluiher3

8:56 am on Jan 9, 2023 (gmt 0)



Useful information

phranque

10:20 am on Jan 9, 2023 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



welcome to WebmasterWorld [webmasterworld.com], soluiher3!