Should UTF-8 be a swear word ?
Thomas Busch
tbusch at cpan.org
Tue Aug 8 17:23:06 BST 2006
sorry about the spelling. apparently i cannot spell
and understand utf8 issues at the same time.
thomas.
> I get it know. Can you confirm the folling:
>
> 1) $string =~ m/\w/ will match any european accented character
> including the german sz (also called scharfes S) if $string
> has the UTF8 flag on.
>
> 2) \xE9 actually means U+00E9. What I mean by this is that
> \x{...} refers to unicode point notation and not to UTF-8.
>
> 3) no matter how $string is encoded, binmode STDOUT, ":utf8"
> will force print "..." to always output in UTF-8. There will
> be no double encoding.
>
> Also this triggers two new questions.
>
> a) Is there an efficient way to say to perl, "please downgrade
> this string to latin1 if possible otherwise leave it in UTF-8" ?
>
> b) What happens in the case of $s1 =~ m/$s2/ if $s2 has the
> UTF8 flag on and $s1 hasn't ? Does this work like excepted ?
>
> Thomas.
>
> > On Tue, Aug 08, 2006 at 03:27:54PM +0200, Thomas Busch wrote:
> >
> > > maybe someone can help on the following UTF-8 issue
> > > which left a few perl engineers angry and frustrated.
> > > As a matter of fact in my office UTF-8 is currently a
> > > swear word.
> > >
> > > I'm using perl 5.8.6 and for some strange reason the
> > > following program:
> > >
> > > #!/usr/bin/perl
> > >
> > > use strict ;
> > >
> > > my $string = "cl\xe9ment";
> > >
> > > utf8::upgrade($string);
> > >
> > > if (utf8::is_utf8($string)) {
> > > print "is utf8\n";
> > > }
> > >
> > > if (utf8::valid($string)) {
> > > print "is valid utf8\n";
> > > }
> > >
> > > if ($string =~ m/\xe9/) {
> > > print "match \\xE9\n";
> > > }
> > >
> > > if ($string =~ m/\x{c3a9}/) {
> > > print "match \\xC3A9\n";
> > > }
> > >
> > > yields
> > >
> > > is utf8
> > > is valid utf8
> > > match \xE9
> > >
> > > instead of
> > >
> > > is utf8
> > > is valid utf8
> > > match \xC3E9
> > >
> > > Is this a bug ? Why is the latin e letter with acute not
> >
> > No.
> >
> > > getting upgraded to UTF-8 ?
> >
> > It is.
> >
> > You're misunderstanding how things work.
> > \x{c3a9} is (apparently) the character HANGUL SYLLABLE SSOB
> >
> > [modulo real bugs, which are now minimal in the core, but not necessarily
> > minimal in CPAN modules]:
> >
> > String comparison and matching is done on a character by character
> basis.(*)
> > In Perl space you're dealing with characters. The encoding used internally
> > should be treated as a black box.
> >
> > utf8::upgrade() doesn't do any character to byte conversion in perl space.
> > It does something to the black box, related to the bug marked '*'
> >
> > If what you want is to convert Perl string (which is Unicode characters)
> > into a sequence of UTF-8 bytes for passing to something external, you
> > want utf8::encode().
> >
> > In an ideal world the bug marked '*' would not have existed, nor would
> > utf8::upgrate() and utf8::downgrade()
> >
> > Possibly the two should be deprecated for 5.10, with alternatives provided
> > in the Internals:: namespace.
> >
> > Nicholas Clark
> >
> > * Apart from the big implementation bug that is hard to fix. If a string
> > internally is encoded in UTF-8, then it is also treated with Unicode
> > semantics. For example code point 0xE9 is recognised as a lower case
> > E acute, and will uppercase to É, and it will match classes like \w in
> > the regexp engine. Whereas if the same string is stored as raw bytes,
> that
> > byte is not treated as a letter - only the ASCII letters are letters,
> etc.
> >
> > There is a single flag bit in the internals used both to mean
> "characters
> > to be treated as Unicode" and "characters stored as UTF-8". If two bits
> > had been used, there would be less confusion. Also, if 5.6 hadn't been
> as
> > buggy, there would be less confusion.
> >
>
>
>
>
More information about the london.pm
mailing list