Should UTF-8 be a swear word ?

Tatsuhiko Miyagawa miyagawa at gmail.com
Wed Aug 9 03:10:55 BST 2006


On 8/8/06, Thomas Busch <tbusch at cpan.org> wrote:
> my $string = "cl\xe9ment";
>
> utf8::upgrade($string);

1) utf8::upgrade means upgrading the byte string to Unicode string. It
*doesn't* necessary gurantee the internal representation is utf-8.
Practically, if the string contains bytes larger than 255 it's encoded
in utf-8 and otherwise latin-1. Anyways you shouldn't rely on the
internal encoding.

2) \x{c3a9} actually refers Unicode character U+C3A9, not utf-8 bytes \xc3\xa9.

That said, try this instead:

  my $string = "cl\x{e9}ment";
  utf8::encode($string);

  if ($string =~ /\xc3\xa9/) {
      print "match \\xc3\\xa9\n";
  }


-- 
Tatsuhiko Miyagawa


More information about the london.pm mailing list