XML and UTF-8 BOM. [Was Re: Using Template Toolkit and UTF-8]

Aaron Crane perl at aaroncrane.co.uk
Thu Jan 19 14:06:50 GMT 2006


Matt Sergeant writes:
> On Thu, 19 Jan 2006, Aaron Crane wrote:
> > There wasn't really meant to be any such thing as a "UTF-8 BOM", and
> > there are situations in which it's harmful.  (It's not clear that XML
> > documents are well-formed if their first three bytes are 0xef 0xbb 0xbf
> > and they contain an XML declaration, for example.)
> 
> Not so. You can even read the XML::SAX::PurePerl code for processing BOMs 
> which looks for this before checking for XML content. It's even talked 
> about in the XML spec, IIRC.

You're right; it's in the current XML 1.0 spec.  That surprised me, so I
checked a little further.

http://www.w3.org/TR/2004/REC-xml-20040204/#charencoding (XML 1.0 Third
Edition) says:

  "Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY
  begin with the Byte Order Mark"

But, http://www.w3.org/TR/2000/REC-xml-20001006#charencoding (XML 1.0
Second Edition) says only:

  "Entities encoded in UTF-16 must begin with the Byte Order Mark"

So it looks like I'm a little out of date on the UTF-8 BOM issue.

-- 
Aaron Crane


More information about the london.pm mailing list