Perl generated XML - character sets
Matt Lawrence
matt.lawrence at virgin.net
Tue Sep 18 12:02:20 BST 2007
alex at owal.co.uk wrote:
> I have to admit that I dont know much about character sets used in XML -
> and even less about when perl generates that XML.
>
> So I have some perl code which generates XML. On Linux it works fine, but
> on Solaris some funny characters come through - probably a euro symbol -
> and the generated file is no longer valid XML.
>
The funny characters are probably either single-byte when unicode was
expected or vice-versa. Do you know which?
> Can anyone point me at stuff I need to read and learn to try and sort this
> out?
>
perldoc Encode
also see perlrun for the -C switch and PERL_UNICODE env var.
> Can anyone explain to me why it behaves differently on the different OS'es
> when AFAIK the perl versions are the same.
>
Default character set behaviour is determined by environmental factors.
Check locale, or the PERL_UNICODE environment variable.
Specifying an output layer to your filehandles is a good way of ensuring
that your program is immune to these factors:
# Make sure outgoing data is not unicode
binmode(STDOUT, ':bytes');
# Make sure the outgoing data is UTF-8
binmode(STDOUT, ':utf8');
# Of course, these can be specified in open too:
open(FH, '>:bytes', $file) or die $!;
It's also important to know how your incoming data is encoded, Encode
can help you with that.
Matt
More information about the london.pm
mailing list