Wide character in syswrite

Wed Dec 12 04:57:33 GMT 2007

On Tue, 2007-12-11 at 23:46 +0000, Robert Rothenberg wrote:
> On 11/12/07 17:40 Jonathan Rockway wrote:
> > On Tue, 2007-12-11 at 17:35 +0000, Matt Lawrence wrote:
> >> so I would guess the right thing to
> >> do is to Encode::decode('UTF-8', ...) before passing into the module.
> > 
> > No.  You should fix the module and send the author a patch.  It might
> > take 5 extra minutes, but you'll fix the problem *for every person in
> > the world* :)
> 
> The data is properly encoded in UTF-8, and Encode::is_utf8($strong, -1) says
> it's well-formed UTF-8.
> 
> So which module is broken? Net::Dict or Net::Cmd?  The error is associated
> with file handles not being configured for utf8, so I suspect the latter.

This is where the confusion comes in.  Encode::is_utf8 checks if the
utf8 flag (in the SV structure) is on.  This means that $strong is a
string of wide characters, not a string of octets (latin-1).  So what
you need to do is take this string of characters and convert it to a
string of octets.  You can do this with utf8::encode or Encode::encode.
The result of that function will be a string with the utf8 flag *turned
off*, full of *octets* that represent the unicode characters.  

Did I mention that this is confusing? :)

Quick example:

  my $var = <STDIN>;

By default, $var is utf8 octets representing unicode characters (if the
user types in utf8).  If you regex $var, /./ will match *one byte*,
which is not the same as one character.  Perl doesn't know about
characters yet, though, it's just bytes.

  utf8::decode($var);

Now $var is characters representing unicode characters.  If you regex
this string /./ will match one wide character.  This is the form you
want $var to be in while it's in your program.  Working with bytes
doesn't make sense, you want characters.

  syswrite STDOUT, $var; # FAIL

This fails because you are trying to write perl characters to a
filehandle.  The outside world has no idea what perl characters are, so
you can't write perl characters to the outside world (and this code
C<die>s).

  utf8::encode($var);

Assuming we restarted the program one paragraph above... we're just
reversing decode here.  We're converting perl's internal characters into
utf8 octets that the real world (xterms, web browser) can understand.

  syswrite STDOUT, $var;

Now this works, because we are writing octets to STDOUT instead of
characters.  Whatever's on the other side of STDOUT now has to take
those octets and convert them to characters, but we don't really care
because we're done.

One more way of looking at it:

   my $var = <octets>; # binary
   utf8::decode($var); # text
   print $text; (*) # not allowed, stdout only speaks binary
   utf8::encode($var); # $var is binary again
   print $text; # and now we can print it

 (*) print can print wide chars, but with a warning.  don't do it.

Hope this helps.  Perl's unicode can be tricky.

Regards,
Jonathan Rockway
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://london.pm.org/pipermail/london.pm/attachments/20071212/bb4f4b25/attachment.pgp