Forum Moderators: open
I'm pulling new stories from a database and putting the content into an XML document in order to download as a CSV or RTF file.
But the download breaks if I have an ampersand &. To fix this I have to encode it to &
But this then displays in my downloads, which is undesirable.
First of all, am I correct that ampersands break XML?
Secondly, are there any other characters that I have to watch out for an encode? It's just a shame that I have to display the html entities instead of the actual symbols in the downloads.
Thanks
You need to be careful using HTML entities in XML. Here's why.
Say you use PHP's htmlentities() function to turn all your "£" into "£", "©" into "©", ad so on.
For example:
<element>My Business ©</element>
They will live comfortably in the XML, and if you're outputting the XML node values on a web page, the browser will display them as £ and ©. Browsers are generally pretty nice about rendering HTML entities.
But the fragment above will not validate, because XML doesn't know what a "©" is. That's an HTML entity, not an XML entity.
So if you're parsing the XML document with PHP's xml_parse(), it'll choke. The reason is the DTD - the built-in XML DTD will have entity declarations for the basic furious five: &, >, <, ", and '. You can include those five entities in any XML document, without worrying about extending the DTD.
But a generic XML document almost certainly won't have a declaration for üaut; or þ. Those are HTML entities, not XML entities. Just as HTML defines a <table> tag, it also defines what £ is. That is, HTML knows what a £ is, but XML does not.
XML is an open language and you can declare your own entities, like &google; or &myvariable; - they're all legitimate, but they do have to be in the DTD or the parser will judge that your document is invalid.
If you're going to use HTML entities in an XML document, you need to declare all the ones you use, right at the beginning, like so:
<?xml?>
<!ENTITY pound CDATA "£">
<!ENTITY copy CDATA "©">
<rootnode>
<child>
<child>...
If you're the only one using your XML output, and you get it doing something useful without it XML being valid (ie, having undeclared entities in it), then you may ignore all of this. But someday, someone else may want to use your data, and if it's not valid they may have problems parsing it, or transforming it with XSLT, or whatever else. I'm sure the guilt will haunt you relentlessly.
Cheers