Forum Moderators: goodroi

Message Too Old, No Replies

Does sitemap.xml format require spaces be replaced with %20 ?

Bing Webmaster Tools sitemap processing

         

SumGuy

12:43 am on Aug 11, 2022 (gmt 0)

5+ Year Member Top Contributors Of The Month



I logged into (or I created) a Bing Webmaster tools account, and also obtained from the web a third-party windows program (G-Mapper Sitemap Generator). G-mapper created sitemap.xml when I pointed it to my site. My site has many PDF files, most of them have spaces in the file name.

The server logs indicate that G-mapper replaced the spaces with %20 when requesting the files, but in the resulting XML file it did not construct the file-names with %20, it just kept the spaces. I placed the XML file in the site root folder.

I then had Bing Webmaster console access the XML file and process it. I note that in my server logs there were many 404 errors because Bing was taking the spaces out of the file names (and not replacing them with %20).

If my site is navigated with a standard browser, all these PDF files are linked and visible and clickable and downloadable with no 404 errors.

Before I started this effort with the Webmaster tools and the sitemap file, googlebot and bingbot (during their regular crawls of the site) will replace spaces with %20 (as indicated in the logs) so I never see 404's from them. But this bing webmaster thing doesn't behave that way - it seems to need to have spaces in file names replaced with %20 in the sitemap file.

So I'm wondering here just what exactly is the accepted practice for handing file-names with spaces? I can find nothing authoritative on just what is the accepted practice of handling file names with space characters or the substitution of %20 in place of space characters in html page code or the sitemap.xml file.

lucy24

2:11 am on Aug 11, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



what exactly is the accepted practice for handling file-names with spaces?
Avoid, avoid, avoid. But I suppose it's too late for that now.

tangor

2:34 am on Aug 11, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Second lucy24 ... avoid spaces in filenames/urls at all costs. If you don't like _ use - or my personal favoritemakeitallone. As for sitemap standards, I can't really help you. Never used one, never will use one. See no need. The bots find everything just fine without a roadmap, spaces included.

While chatting, also standardize case in filenames. "This" is different than "this", for example.

phranque

3:37 am on Aug 11, 2022 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



this is called URL encoding.
it is completely normal and browsers and servers are designed to handle percent-encoded urls properly since it has been part of the standard since forever.

a better practice would be to avoid delimiters such as a blank/space character within your file name (i use hyphens instead), but given that you are already here...

https://www.ietf.org/rfc/rfc3986.html#page-12

SumGuy

1:16 am on Aug 20, 2022 (gmt 0)

5+ Year Member Top Contributors Of The Month



I created another sitemap.xml file where all the spaces in file names were replaced by %20 and I uploaded that to the bing webmaster tools thing. During the next few days, nothing changed. I was still seeing a ton of 404 errors from bingbot because it had removed all spaces from the file names and was not asking for any files where I had put in %20. So I then wrote a batch file that copied all the files in question to duplicate files but with spaces removed. Bingbot immediately found those files and so no more 404 errors. What a crock this whole excercise has been.

phranque

2:33 am on Aug 20, 2022 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



i would complain to whoever decided delimiter characters in file names was a good idea.

lucy24

4:13 am on Aug 20, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Alternative explanation: This may actually not be a sitemap issue at all. In recent years, bingbot has got into the habit of inventing variants of filenames and asking for them over and over, even though you have done nothing to make them think their version of the name exists. For example, if I have files called chapA.html and chapB.html, which have never gone by any other casing, bingbot will repeatedly ask for chapa.html and chapb.html.

phranque

5:04 am on Aug 20, 2022 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



For example, if I have files called chapA.html and chapB.html, which have never gone by any other casing, bingbot will repeatedly ask for chapa.html and chapb.html.

this is the pot calling the kettle black here.
(or some similar type of recognition metaphor)

99.9% of the url casing problems on the net are caused by the case-insensitivity of FAT files systems and their spawn.

lucy24

4:54 pm on Aug 20, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



When a particular behavior is only manifested by the bingbot, it is safe to assume it is the bingbot’s error. (They are also very fond of attaching some other site's paths to my hostname. Interestingly, they've linked me up with one or more Norwegian sites.)