Forum Moderators: mack
Bing Publishes Sitemaps Best Practice, Including Large Sites
Interestingly some sites these days, are large… really large… with millions to billions of URLs. Sitemap index files or sitemap files can link up to 50,000 links, so with one sitemap index file, you can list 50,000 x 50,000 links = 2,500,000,000 links. If you have more than 2.5 Billion links… think first if you really need so many links on your site. In general search engines will not crawl and index all of that. It’s highly preferable that you link only to the most relevant web pages to make sure that at least these relevant web pages are discovered, crawled and indexed. Just in case, if you have more than 2.5 billion links, you can use 2 sitemap index files, or you can use a sitemap index file linking to sitemap index files offering now up to 125 trillion links: so far that’s still definitely more than the number of fake profiles on some social sites, so you’ll be covered.Bing Publishes Sitemaps Best Practice, Including Large Sites [blogs.bing.com]
Best practices if you want to enable a sitemaps
Sitemaps are a waste of time. They won't help you with indexing.
Smart sitemap downloading checks that and does not hammer sites by downloading the same damned sitemaps over and over again.
Everyone uses Google. No one uses Bing.
I don't believe even Amazon has a billion unique pages.Some sites are quite deep and could have that many pages. My own has the hosting history of gTLD domains back to 2000 and the stats for over 5.5 million hosters. It would have about 400 million pages on the domain name pages alone. Amazon has pages for books in print, eBooks and books out of print. It also has product pages so it is possible that it is that deep. Facebook would also, theoretically, have large numbers of pages.
The main problem with extra-large sitemaps is that search engines are often not able to discover all links in them as it takes time to download all these sitemaps each day.Operators of large websites typically have a well defined sitemaps strategy that prioritises changed content and additions over unchanged sitemaps. The sitemap index files are used, in effect, to signal to search engines which sitemaps have changed. Thus after the initial download of sitemaps, the search engine only needs to download and process the the changed sitemaps.
Search engines cannot download thousands of sitemaps in a few seconds or minutes to avoid over crawling web sites; the total size of sitemap XML files can reach more than 100 Giga-Bytes.A Social Science number. Sounds impressive but it is not based on reality. When you have a large site with large numbers of sitemaps, you count bytes.
Between the time we download the sitemaps index file to discover sitemaps files URLs, and the time we downloaded these sitemap files, these sitemaps may have expired or be over-written.These are the important things in a sitemap file:
<lastmod>2005-01-01</lastmod>
<changefreq>monthly</changefreq>
<sitemap>
<loc>http://www.example.com/sitemap1.xml.gz</loc>
<lastmod>2004-10-01T18:23:17+00:00</lastmod>
</sitemap>
Additionally search engines don’t download sitemaps at specific time of the day; they are so often not in sync with web sites sitemaps generation process.The whole concept of lastmod time is crucial for large websites.
Having fixed names for sitemaps files does not often solve the issue as files, and so URLs listed, can be overwritten during the download process.Huh? One of the key aspects of building large sitemaps is to split the pages into sitemap files and maintain them in those files. That way there is a continuity and new content gets included in new files where necessary. The sitemaps grow in synch with the architecture of the websites.