Should UTF-8 be a swear word ?

Tue Aug 8 14:50:46 BST 2006

On Tue, Aug 08, 2006 at 03:27:54PM +0200, Thomas Busch wrote:

> maybe someone can help on the following UTF-8 issue
> which left a few perl engineers angry and frustrated.
> As a matter of fact in my office UTF-8 is currently a
> swear word.
> 
> I'm using perl 5.8.6 and for some strange reason the
> following program:
> 
> #!/usr/bin/perl
> 
> use strict ;
> 
> my $string = "cl\xe9ment";
> 
> utf8::upgrade($string);
> 
> if (utf8::is_utf8($string)) {
>   print "is utf8\n";
> }
> 
> if (utf8::valid($string)) {
>   print "is valid utf8\n";
> }
> 
> if ($string =~ m/\xe9/) {
>   print "match \\xE9\n";
> }
> 
> if ($string =~ m/\x{c3a9}/) {
>   print "match \\xC3A9\n";
> }
> 
> yields
> 
> is utf8
> is valid utf8
> match \xE9
> 
> instead of
> 
> is utf8
> is valid utf8
> match \xC3E9
> 
> Is this a bug ? Why is the latin e letter with acute not

No.

> getting upgraded to UTF-8 ?

It is.

You're misunderstanding how things work.
\x{c3a9} is (apparently) the character HANGUL SYLLABLE SSOB

[modulo real bugs, which are now minimal in the core, but not necessarily
minimal in CPAN modules]:

String comparison and matching is done on a character by character basis.(*)
In Perl space you're dealing with characters. The encoding used internally
should be treated as a black box.

utf8::upgrade() doesn't do any character to byte conversion in perl space.
It does something to the black box, related to the bug marked '*'

If what you want is to convert Perl string (which is Unicode characters)
into a sequence of UTF-8 bytes for passing to something external, you
want utf8::encode().

In an ideal world the bug marked '*' would not have existed, nor would
utf8::upgrate() and utf8::downgrade()

Possibly the two should be deprecated for 5.10, with alternatives provided
in the Internals:: namespace.

Nicholas Clark

* Apart from the big implementation bug that is hard to fix. If a string
  internally is encoded in UTF-8, then it is also treated with Unicode
  semantics. For example code point 0xE9 is recognised as a lower case
  E acute, and will uppercase to É, and it will match classes like \w in
  the regexp engine. Whereas if the same string is stored as raw bytes, that
  byte is not treated as a letter - only the ASCII letters are letters, etc.

  There is a single flag bit in the internals used both to mean "characters
  to be treated as Unicode" and "characters stored as UTF-8". If two bits
  had been used, there would be less confusion. Also, if 5.6 hadn't been as
  buggy, there would be less confusion.