Test::XML not working with UTF-8
John Ramsden-Developer
John.Ramsden2 at bbc.co.uk
Thu Feb 1 15:52:46 GMT 2007
[Many thanks for your prompt reply, Robin - my reply is interleaved]
-----Original Message-----
From: Robin Barker [mailto:Robin.Barker at npl.co.uk]
Sent: 01 February 2007 13:30
To: London.pm Perl M[ou]ngers; John Ramsden-Developer
Subject: RE: Test::XML not working with UTF-8
> > Anyway, getting to the point, I wonder if anyone has any ideas why
> > Test::XML fails to recognize UTF-8 characters, or can think of an
> > alternative I might use if Test::XML is no good for UTF-8.
>
> Test::XML uses XML::SemanticDiff which uses Digest::MD5.
>
> Perl 5.8 support Unicode characters in strings. Since the
> MD5 algorithm is only defined for strings of bytes, it can
> not be used on strings that contains chars with ordinal
> number above 255. The MD5 functions and methods will croak
> if you try to feed them such input data.
>
> There is work around in Digest::MD5, which I have implemented
> in XML::SemanticDiff, patch below.
>
> I don't think Test::XML or XML::SemanticDiff know about
> encode="UTF-8".
I also concluded I needed to call Encode::encode_utf8() on both
strings just before calling Test::XML::is_xml(), to turn off the
UTF-8 flag of each and thus treat it as a ANSI byte sequence
(without changing any of the bytes).
This shouldn't affect the XML structure of the strings, and as
we are only comparing them for equality it doesn't matter what
the bytes represent. Sure enough, when I tried this on my test
XML strings the test passes fine.
> > P.S. Is there a difference between 'use utf8' and 'utf
> > encoding utf8'? One of my colleagues reckons they are
> > equivalent.
>
> Unless your perl script file is utf8 encoded, you don't need
> either. Your is all ASCII: \x{263A} is just 9 ASCII characters.
I think one or both of these have more impact than just telling
Perl to read the script file as UTF-8.
For example, according to the schpiel for encoding::warnings,
'use encoding utf8' tells the PerlIO layer to assume an implicit
':utf8' in file open mode specs, and ditto for STDIN and STDOUT,
unless a character encoding is explicitly specified in the
open() or binmode() respectively.
>
> Robin
>
Cheers
John R Ramsden
http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.
More information about the london.pm
mailing list