[OT] xml encoding

Dirk Koopman djk at tobit.co.uk
Fri Jan 6 15:40:11 GMT 2006


I am trying to coerce libxml2 into storing and printing "binary" data.
Could someone help my understanding a bit here. Take the following small
chunk of XML, which part of a much bigger and otherwise well formed XML
document.

   <PASSWORD>rs&#16;&#30;&#25;*  &#6;</PASSWORD>

This, very nearly, represents the data that I require. 

What I am doing is to take some fields that contain binary data (a very
small percentage of the whole gamut of fields that are to be output) and
building up a libxml2 doc tree in memory. That all works just fine. The
input data is guaranteed to be UTF-8, either because it is (because I
convert the characters above 127 into UTF-8) or is converted to
character entities like &#16; (or &#x10, tried both). On output (as
UTF-8) for this field I get:

  <PASSWORD>rs^P^^^Y*  ^F</PASSWORD>

But putting that into any xml parser will fail on the '^P' after
'<PASSWORD>rs'. 

What understanding am I missing? Why is the above not well formed? It is
UTF-8. If necessary, how do I force characters less than 32 to be output
as &#99; (or &#x99;)? 

Poking around in the interstices of libxml tells me that xmlNewChild()
carefully converts entities like &#99; back into the binary equivalent.
Preventing that (by doing things more "manually" or using
xmlNewTextChild()) produces output like the first example.

My (already sparse) hair is getting rapidly thinner! 

Dirk

-- 
Dirk Koopman <djk at tobit.co.uk>



More information about the london.pm mailing list