Should UTF-8 be a swear word ?

Thomas Busch tbusch at cpan.org
Tue Aug 8 17:18:19 BST 2006


Hi Nicolas,

I get it know. Can you confirm the folling:

1) $string =~ m/\w/ will match any european accented character
   including the german sz (also called scharfes S) if $string
   has the UTF8 flag on.

2) \xE9 actually means U+00E9. What I mean by this is that
   \x{...} refers to unicode point notation and not to UTF-8.

3) no matter how $string is encoded, binmode STDOUT, ":utf8"
   will force print "..." to always output in UTF-8. There will
   be no double encoding.

Also this triggers two new questions. 

a) Is there an efficient way to say to perl, "please downgrade
   this string to latin1 if possible otherwise leave it in UTF-8" ?

b) What happens in the case of $s1 =~ m/$s2/ if $s2 has the
   UTF8 flag on and $s1 hasn't ? Does this work like excepted ?

Thomas.

> On Tue, Aug 08, 2006 at 03:27:54PM +0200, Thomas Busch wrote:
> 
> > maybe someone can help on the following UTF-8 issue
> > which left a few perl engineers angry and frustrated.
> > As a matter of fact in my office UTF-8 is currently a
> > swear word.
> > 
> > I'm using perl 5.8.6 and for some strange reason the
> > following program:
> > 
> > #!/usr/bin/perl
> > 
> > use strict ;
> > 
> > my $string = "cl\xe9ment";
> > 
> > utf8::upgrade($string);
> > 
> > if (utf8::is_utf8($string)) {
> >   print "is utf8\n";
> > }
> > 
> > if (utf8::valid($string)) {
> >   print "is valid utf8\n";
> > }
> > 
> > if ($string =~ m/\xe9/) {
> >   print "match \\xE9\n";
> > }
> > 
> > if ($string =~ m/\x{c3a9}/) {
> >   print "match \\xC3A9\n";
> > }
> > 
> > yields
> > 
> > is utf8
> > is valid utf8
> > match \xE9
> > 
> > instead of
> > 
> > is utf8
> > is valid utf8
> > match \xC3E9
> > 
> > Is this a bug ? Why is the latin e letter with acute not
> 
> No.
> 
> > getting upgraded to UTF-8 ?
> 
> It is.
> 
> You're misunderstanding how things work.
> \x{c3a9} is (apparently) the character HANGUL SYLLABLE SSOB
> 
> [modulo real bugs, which are now minimal in the core, but not necessarily
> minimal in CPAN modules]:
> 
> String comparison and matching is done on a character by character basis.(*)
> In Perl space you're dealing with characters. The encoding used internally
> should be treated as a black box.
> 
> utf8::upgrade() doesn't do any character to byte conversion in perl space.
> It does something to the black box, related to the bug marked '*'
> 
> If what you want is to convert Perl string (which is Unicode characters)
> into a sequence of UTF-8 bytes for passing to something external, you
> want utf8::encode().
> 
> In an ideal world the bug marked '*' would not have existed, nor would
> utf8::upgrate() and utf8::downgrade()
> 
> Possibly the two should be deprecated for 5.10, with alternatives provided
> in the Internals:: namespace.
> 
> Nicholas Clark
> 
> * Apart from the big implementation bug that is hard to fix. If a string
>   internally is encoded in UTF-8, then it is also treated with Unicode
>   semantics. For example code point 0xE9 is recognised as a lower case
>   E acute, and will uppercase to É, and it will match classes like \w in
>   the regexp engine. Whereas if the same string is stored as raw bytes, that
>   byte is not treated as a letter - only the ASCII letters are letters, etc.
> 
>   There is a single flag bit in the internals used both to mean "characters
>   to be treated as Unicode" and "characters stored as UTF-8". If two bits
>   had been used, there would be less confusion. Also, if 5.6 hadn't been as
>   buggy, there would be less confusion.
> 





More information about the london.pm mailing list