utf8 oddness
Aaron Crane
perl at aaroncrane.co.uk
Wed Jun 10 17:47:50 BST 2009
Paul Makepeace writes:
> rpix:~$ perl -le 'print ord("À")'
> 195
>
> What does 195 refer to? 195 is \xC3 which is another character,
> according to http://jeppesn.dk/utf-8.html (A~ versus A`)
For compatibility, Perl assumes that source code is in Latin-1. I
assume that your terminal uses UTF-8. So when you type À, you get the
UTF-8 representation of that Unicode character, which consists of the
two bytes 0xC3 and 0x80. With Latin-1-encoded source, Perl treats a
string literal containing those bytes as the two-character string
"\xC3\x80"; then ord returns the codepoint for the first character in
that string.
You can change this by telling Perl that your source code is in UTF-8:
$ perl -le 'use utf8; print ord("À")'
192
> rpix:~$ perl -le 'print chr(195)'
> ##
>
> What's happening here?
Perl also assumes that stdout is Latin-1. chr returns a one-character
string, which gets printed to the Latin-1-encoded stdout as a single
0xC3 byte (which isn't displayable on a UTF-8 terminal).
There are various ways to tell Perl that stdout is UTF-8, including these:
$ perl -CS -le 'print chr(195)'
Ã
$ perl -le 'binmode STDOUT, ":utf8"; print chr(195)'
Ã
> rpix:~$ perl -le 'print "\xc3\x80"'
> À
>
> (So printing utf8 octets produces something reasonable.)
But only by coincidence -- the stream you printed to actually expects
to receive UTF-8-encoded data, but Perl thinks the stream uses Latin-1
encoding. The only reason it seemed to work is that you just happened
to print a string whose Latin-1-encoded bytes could be reinterpreted
by your terminal as valid UTF-8.
> rpix:~$ perl -MEncode -le 'print decode("iso-8859-1", chr(195))'
> ##
>
> What's this doing? Presumably chr(195) isn't \xC3 in Latin-1 so what is it?
On the contrary, chr(195) eq chr(0xC3) eq "\xC3" always.
Your code here manufactures a string containing the single character
with codepoint U+00C3; that string is byte-encoded internally, so it
consists of the single byte 0xC3. Then decode() takes that single-byte
string, decodes it from Latin-1 to one of Perl's two internal encodings,
and prints the result. In particular, it happens to have picked the
single-byte internal encoding, so the entire decode() step did nothing
at all.
This code is questionable, by the way. chr returns a string in either
of Perl's two internal encodings, but decode expects a byte-encoded
string. In this case it won't matter, because chr in all current
Perls produces a byte-encoded string for codepoints <= 255. But if
the 195 varied at run time, and the actual value could be greater than
255, you'd get an exception.
> rpix:~$ perl -MEncode -le '$a = chr(195); print decode("iso-8859-1",
> $a, Encode::FB_CROAK)'
> ##
>
> Why no croaking?
Because it was possible to decode the input without error as Latin-1.
More generally, *any* Perl string which is byte-encoded internally
can be decoded without error as Latin-1, because all single-byte
codepoints have character allocations in Latin-1.
> rpix:~$ perl -MEncode=from_to -le '$a = chr(195); from_to($a,
> "iso-8859-1", "utf8", Encode::FB_CROAK); print $a'
> Ã
> rpix:~$
>
> Ah, from_to works where decode didn't. But why? My understanding is
> that from_to is the same except leaves the utf8 flag off. Reassuringly
> at least, the character printed there IS Latin-1's \xC3 (not the
> slightly different accent).
Your use of from_to() here is roughly equivalent to
encode("utf8", decode("iso-8859-1", $a))
The important part is the encode() step: it encodes the output string
to the bytes that represent it in UTF-8. Since your terminal uses
UTF-8, this produces output you can see. (Telling Perl that stdout
is UTF-8-encoded has the same effect, but the transcoding to UTF-8
happens where you can't see it and don't have to worry about it.)
> rpix:~$ perl -MEncode -le 'print Encode::is_utf8("À")'
>
> How can this not be true?
Because it's a two-character byte-encoded string; there's no UTF-8
here, since you haven't told Perl to expect any, and you haven't used
any characters whose codepoint is high enough to require Perl to use
UTF-8. And Encode::is_utf8() is documented to just examine the
internal flag that indicates which internal encoding is in use.
> rpix:~$ perl -MEncode -le 'print Encode::is_utf8("À", Encode::FB_CROAK)'
>
> It's not utf8 but it's not croaking either, ...?
The second argument to Encode::is_utf8() isn't for specifying fallback
behaviour, it's for saying that, if the string is internally marked as
using the multi-byte UTF-8-like encoding, its data should also be
examined to see whether it's valid in that encoding. But since the
internal flag says "no UTF-8 on this string", that doesn't actually
apply.
For more information on all this, I recommend Juerd's perlunitut
documentation, as found in 5.8.9 and 5.10.
--
Aaron Crane ** http://aaroncrane.co.uk/
More information about the london.pm
mailing list