Should UTF-8 be a swear word ?

Tue Aug 8 17:23:06 BST 2006

sorry about the spelling. apparently i cannot spell
and understand utf8 issues at the same time.

thomas.

> I get it know. Can you confirm the folling:
> 
> 1) $string =~ m/\w/ will match any european accented character
>    including the german sz (also called scharfes S) if $string
>    has the UTF8 flag on.
> 
> 2) \xE9 actually means U+00E9. What I mean by this is that
>    \x{...} refers to unicode point notation and not to UTF-8.
> 
> 3) no matter how $string is encoded, binmode STDOUT, ":utf8"
>    will force print "..." to always output in UTF-8. There will
>    be no double encoding.
> 
> Also this triggers two new questions. 
> 
> a) Is there an efficient way to say to perl, "please downgrade
>    this string to latin1 if possible otherwise leave it in UTF-8" ?
> 
> b) What happens in the case of $s1 =~ m/$s2/ if $s2 has the
>    UTF8 flag on and $s1 hasn't ? Does this work like excepted ?
> 
> Thomas.
> 
> > On Tue, Aug 08, 2006 at 03:27:54PM +0200, Thomas Busch wrote:
> > 
> > > maybe someone can help on the following UTF-8 issue
> > > which left a few perl engineers angry and frustrated.
> > > As a matter of fact in my office UTF-8 is currently a
> > > swear word.
> > > 
> > > I'm using perl 5.8.6 and for some strange reason the
> > > following program:
> > > 
> > > #!/usr/bin/perl
> > > 
> > > use strict ;
> > > 
> > > my $string = "cl\xe9ment";
> > > 
> > > utf8::upgrade($string);
> > > 
> > > if (utf8::is_utf8($string)) {
> > >   print "is utf8\n";
> > > }
> > > 
> > > if (utf8::valid($string)) {
> > >   print "is valid utf8\n";
> > > }
> > > 
> > > if ($string =~ m/\xe9/) {
> > >   print "match \\xE9\n";
> > > }
> > > 
> > > if ($string =~ m/\x{c3a9}/) {
> > >   print "match \\xC3A9\n";
> > > }
> > > 
> > > yields
> > > 
> > > is utf8
> > > is valid utf8
> > > match \xE9
> > > 
> > > instead of
> > > 
> > > is utf8
> > > is valid utf8
> > > match \xC3E9
> > > 
> > > Is this a bug ? Why is the latin e letter with acute not
> > 
> > No.
> > 
> > > getting upgraded to UTF-8 ?
> > 
> > It is.
> > 
> > You're misunderstanding how things work.
> > \x{c3a9} is (apparently) the character HANGUL SYLLABLE SSOB
> > 
> > [modulo real bugs, which are now minimal in the core, but not necessarily
> > minimal in CPAN modules]:
> > 
> > String comparison and matching is done on a character by character
> basis.(*)
> > In Perl space you're dealing with characters. The encoding used internally
> > should be treated as a black box.
> > 
> > utf8::upgrade() doesn't do any character to byte conversion in perl space.
> > It does something to the black box, related to the bug marked '*'
> > 
> > If what you want is to convert Perl string (which is Unicode characters)
> > into a sequence of UTF-8 bytes for passing to something external, you
> > want utf8::encode().
> > 
> > In an ideal world the bug marked '*' would not have existed, nor would
> > utf8::upgrate() and utf8::downgrade()
> > 
> > Possibly the two should be deprecated for 5.10, with alternatives provided
> > in the Internals:: namespace.
> > 
> > Nicholas Clark
> > 
> > * Apart from the big implementation bug that is hard to fix. If a string
> >   internally is encoded in UTF-8, then it is also treated with Unicode
> >   semantics. For example code point 0xE9 is recognised as a lower case
> >   E acute, and will uppercase to É, and it will match classes like \w in
> >   the regexp engine. Whereas if the same string is stored as raw bytes,
> that
> >   byte is not treated as a letter - only the ASCII letters are letters,
> etc.
> > 
> >   There is a single flag bit in the internals used both to mean
> "characters
> >   to be treated as Unicode" and "characters stored as UTF-8". If two bits
> >   had been used, there would be less confusion. Also, if 5.6 hadn't been
> as
> >   buggy, there would be less confusion.
> > 
> 
> 
> 
>