Forum Moderators: open
Whilst opening the UTF-8 - Notepad-saved pages, Textpad gives the following error: Warning: "blahblah.html" contains characters that do not exist in code page 1252 ANSI - Latin I. They will be converted to the system default character if you click OK. Since I got no option, I click OK. It seems to me that this automatically drops the UTF-8 encoding as Chinese characters appear as? again. My guess is that in order to fix the damaged pages (thankfully only a few of them) I will have to remove the scrabbled Chinese characters and re-save as UTF-8 from within Textpad. The save option has a file format dropdown with the options: 'no change', 'PC', 'MAC' and 'UNIX'. I am not quite clear of what this actually means for the file so could someone tell me which is the right choice? My development PC is running on XP and the hosting is on Linux server. Should I chose 'PC' or UNIX? I also presume that saving under Textpad (at least using the default settings) doesn't add the BOM (which should have been discarded whilst opening the initial UTF-8 saved file from Notepad!?). Am I on the right track or did I miss something?
Regards.
If the BOM remains a problem, just break out your favorite hex editor and remove it (backup the file first!).
The PC, MAC and UNIX settings are probably to do with line endings - in Windows, the line ends with
\r\n and UNIX with just \n. BTW I never really got into using UTF-8 until I switched to Ubuntu Linux, in which all text files are in UTF-8 by default. It seems that better tools and editors exist in Linux compared with Windows.
As I mentioned at the beginning I made the mistake of using Notepad. Never again for HTML/UTF. It's still brilliant for saving in ANSI I suppose.
Having break all the pages on that site with Notepad, I tried TextPad. My issue with that was that I never managed to find a way to paste raw Traditional Chinese characters. I tried what was already mentioned in an earlier thread by changing to as many different fonts that my patience could possibly allowed.
Then I tried UltraEdit which has always been my favourite text editor. I managed to remove the FF FE (ÿþ) character from the already broken HTML, but since having no experience in HEX editing I felt uncomfortable saving the files as such. So I thought I should go through all pages (not many thankfully), load them in FF, view source, select all and paste that for a clean start. Immediately after pasting that into UltraEdit, as expected the BOM wasn't there - but after saving it, even with the no BOM option the build-in HEX editor suggested that the BOM was back there! It was really frustrating.
I tried using Unired, again saving without BOM - back to UltraEdit's HEX editor and BOM was there. After that it was obvious. I tried the freeware XVI32 and for the same file reported by UltraEdit as carrying BOM, XVI32 showed otherwise! Final test. Pasted the parsed HTML source of a page into Notepad and saved as UTF-8. Did the same using UltraEdit/No BOM UTF-8. XVI32 reported EF BB BF at the beginning of the Notepad-saved file (as expected) and a healthy UltraEdit-saved file starting with 3C 21 (<!). Plus, with UltraEdit I can now paste Traditional Chinese characters in raw and save without it adding the BOM.
It is just so frustrating that the UltraEdit HEX editor, for some reason wrongfully sees the FF FE at the begining of every UTF-8 file that I created even though I chose to save without the BOM. Does UltraEdit make this false assumption about the non-existent BOM, because of the charset=utf-8 that is within the HTML?
Cheers.
with UltraEdit I can now paste Traditional Chinese characters in raw and save without it adding the BOM.
Is UltraEdit good for working with UTF-8 BOM-less files right off the bat, or is some additional or interim processing with another software required?
1) Start UltraEdit -> New document (creates a *.* blank document).
2) Went to the site (I use FF), view source, select all, copy and paste that into UltraEdit's blank document.
3) Save as - dialog options: 'example.html', 'all files', 'DOS', 'UTF-8 No BOM'.
4) After having saved the file according to the above, you can now paste raw Chinese characters of a site. If you tried that prior saving to UTF-8, the characters would be pasted as?. Not sure if it's got an option to default to UTF-8 when you hit 'new document'.
5) That will give you a UTF-8 BOM-less file from scratch and to verify that you can use XVI32 or any other HEX editor.
Didn't change much of the defaults within the UltraEdit Preferences - I just made sure that the file-type defaults to 'DOS' (for convenience purposes). As I said that worked for me so it would be good to try the evaluation version of UltraEdit first just to make sure it does the trick for you as well. Although not practical, it proved quite handy for me that I had broken those few pages with Notepad, without backing them up first. Hope this helps and apologies if I have been ambiguous in my explanation.
Regards.
I am wondering whether there may be an issue with the FF source viewer that is adding the BOM to the files? Did you try FTPing the files and opening them directly with UltraEdit? I'm wondering whether you'd get the same results.
Regards.
Cheers.
Overall, it is easiest from a development point of view to use one consistent charset throughout the site. If you need to convert a large number of files, you can use the
iconv utility present in most Linux distributions (ie. on most *nix web servers).
A Byte Order Mark (BOM) in a UTF-8 file looks like this in a HEX editor:
[b]EF BB BF[/b] Above, in #:3096378 you mentioned:
I managed to remove thecharacter from the already broken HTML, but since having no experience in HEX editing I felt uncomfortable saving the files as such.[b]FF FE (ÿþ)[/b]
Isn't
[b]FF FE[/b] the UTF-16 Little Endian BOM? I'm seeing this in some UltraEdit text files via the hex editor. Were you suggesting removing that? If [b]FF FE[/b] shows up at the beginning of a UTF-8 file is that an issue? (I'm seeing this on some files that I had supposedly saved as UTF-8 without BOM.) I guess what I'm looking for is a more definitive answer as to the hex data that needs to be removed to remove BOM.
Is the UltraEdit hex editor all that bad? Would it be advisable to get another?
When I saved my HTML from within Notepad as UTF-8, UltraEdit reported it as EF BB BF. I parsed that using FF and copied the code from the View Source (I didn't removed BOM using the HEX, as I wasn't to comfortable in doing that). Then saved it as BOM-less UTF-8 using UltraEdit. Still UltraEdit always reported FF EE at the beginning of all HTML files I tried so far (on 2 different PCs, running different versions of UltraEdit), I can only assume that this is a problem with the UE's build in HEX editor. Other HEX editors I tried on the same files, including XVI32 (ver 2.51) and HexEdit (ver 1.03), didn't report the FF EE on my properly-saved UTF-8 HTML suggesting that UE's HEX is mistaken. UltraEdit, in my opinion, is probably one of the best editors out there, but I won't be trusting its HEX editor in regard with BOM. Still, if you use FF's View Source, copy and paste the code into UltraEdit and save as UTF-8, no-BOM - you should be o.k. I think this is what is confusing here; UltraEdit is indeed capable of creating BOM-less files, but for some reason its HEX editor is always mistaken by suggesting it is actually there!
So my suggestion, for a start, would be to use a different HEX editor - like XV32 or HexEdit which are both free. That way you can be assured that at least your HTML doesn't contain the BOM. From what I know in my limited HTML experience, the first thing that should be at the beginning of HTML files, is the DOCTYPE. Anything else could lead to numerous issues, potentially preventing an Internet Browser from reading the page properly, or maybe causing other problems with SEs. In order to be sure that doesn't happen and to be certain that the BOM isn't there, I would use a HEX editor (not UE's!) to check that the first thing you see on the file is 3C 21 44 etc etc which translates to <!D (beginning of DOCTYPE tag). That gives me peace of mind in my 4.01S that I use for my HTML - but I am not quite sure what the case would be for a properly served XHTML Strict - where an XML tag often precedes the DOCTYPE.
Sorry if I couldn't be of more help - as I too only starting to get the grasps of it!
Regards.
Isn'tthe UTF-16 Little Endian BOM?FF FE
Yes, this is a UTF-16 BOM - however that doesn't stop some editors adding it when supposedly saving as UTF-8 (or more often "Unicode" without specifying which version - like in Notepad).
The difference between UTF-16 and UTF-8 is evident when dealing with US-ASCII characters, which are encoded as single-byte ASCII-compatible in UTF-8 but not UTF-16.
If you encounter a UTF-16 LE BOM then you need to verify how US-ASCII characters are encoded. If they are single-byte ASCII-compatible characters, then your document is UTF-8 with an incorrect BOM - if you use hex editor to remove the BOM the document should function as UTF-8 when served as such.
If characters from the US-ASCII range are double-byte encoded (again a hex editor is your friend), then you need to use
iconv, which is downloadable or available in most Linux distributions (sorry I don't know what you can use on Windows) to convert the file before editing the BOM. Sometimes you can get ISO-8859-1 (or other legacy encoded) documents preceeded by a UTF-16 BOM, usually due to encoding mishaps. In this case, you can remove the BOM with a hex editor, then use
iconv to re-encode as UTF-8 - assuming you can figure out what legacy encoding was used. You should never attempt to use UTF-16 on the web as the presence of a BOM is mandatory, and user agent support is limited.
In every case, it is vital (obviously) to keep backups of the original file. :)
If you encounter a UTF-16 LE BOM then you need to verify how US-ASCII characters are encoded.
[b]FF FE[/b]. So I grabbed a copy of XVI32, and none of the files shows either a UTF-8 or UTF-16 BOM. Just to be on the safe side I tried out HexEdit as well. Same results. Of the two I liked the XVI32 interface better, but they both return the same results for all files tested.
Conclusion: The UltraEdit hex editor is screwy? I'm going to have to check their forums to see if this issue has been raised. It's a shame as I like the rest of the editor functions.
If you encounter a UTF-16 LE BOM then you need to verify how US-ASCII characters are encoded. If they are single-byte ASCII-compatible characters, then your document is UTF-8 with an incorrect BOM - if you use hex editor to remove the BOM the document should function as UTF-8 when served as such.
As encyclo pointed out, Windows does not natively support UTF-8, unlike Linux. So using UltraEdit in Windows, for example, when a UTF-8 format file is loaded it is internally converted to Unicode format for editing and is converted back to UTF-8 format when written to disk. Because of this we are seeing the above issue with the BOM. Therefore the HEX display is accurately representing the state of the file at the time it is being edited.
Up until recently, I used HTML-Kit for developing pages for my site but it is a bit of a problem now that I moved to UTF-8 encoding - which is not supported by HTML-Kit. Does anyone know of any plugin that offers such functionality?
Regards.