XML and UTF-8 BOM. [Was Re: Using Template Toolkit and UTF-8]
Aaron Crane
perl at aaroncrane.co.uk
Thu Jan 19 14:06:50 GMT 2006
Matt Sergeant writes:
> On Thu, 19 Jan 2006, Aaron Crane wrote:
> > There wasn't really meant to be any such thing as a "UTF-8 BOM", and
> > there are situations in which it's harmful. (It's not clear that XML
> > documents are well-formed if their first three bytes are 0xef 0xbb 0xbf
> > and they contain an XML declaration, for example.)
>
> Not so. You can even read the XML::SAX::PurePerl code for processing BOMs
> which looks for this before checking for XML content. It's even talked
> about in the XML spec, IIRC.
You're right; it's in the current XML 1.0 spec. That surprised me, so I
checked a little further.
http://www.w3.org/TR/2004/REC-xml-20040204/#charencoding (XML 1.0 Third
Edition) says:
"Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY
begin with the Byte Order Mark"
But, http://www.w3.org/TR/2000/REC-xml-20001006#charencoding (XML 1.0
Second Edition) says only:
"Entities encoded in UTF-16 must begin with the Byte Order Mark"
So it looks like I'm a little out of date on the UTF-8 BOM issue.
--
Aaron Crane
More information about the london.pm
mailing list