Perl generated XML - character sets

Matt Lawrence matt.lawrence at virgin.net
Tue Sep 18 12:02:20 BST 2007


alex at owal.co.uk wrote:
> I have to admit that I dont know much about character sets used in XML -
> and even less about when perl generates that XML.
>
> So I have some perl code which generates XML. On Linux it works fine, but
> on Solaris some funny characters come through - probably a euro symbol -
> and the generated file is no longer valid XML.
>   

The funny characters are probably either single-byte when unicode was
expected or vice-versa. Do you know which?

> Can anyone point me at stuff I need to read and learn to try and sort this
> out?
>   

perldoc Encode

also see perlrun for the -C switch and PERL_UNICODE env var.

> Can anyone explain to me why it behaves differently on the different OS'es
> when AFAIK the perl versions are the same.
>   

Default character set behaviour is determined by environmental factors.
Check locale, or the PERL_UNICODE environment variable.


Specifying an output layer to your filehandles is a good way of ensuring
that your program is immune to these factors:

# Make sure outgoing data is not unicode
binmode(STDOUT, ':bytes');

# Make sure the outgoing data is UTF-8
binmode(STDOUT, ':utf8');

# Of course, these can be specified in open too:
open(FH, '>:bytes', $file) or die $!;

It's also important to know how your incoming data is encoded, Encode
can help you with that.


Matt



More information about the london.pm mailing list