Should UTF-8 be a swear word ?

Wed Aug 9 14:28:05 BST 2006

On 2006-08-08 at 18:18 +0200, Thomas Busch wrote:
> 3) no matter how $string is encoded, binmode STDOUT, ":utf8"
>    will force print "..." to always output in UTF-8. There will
>    be no double encoding.

Not if { use encoding 'foo' } has been used.  As well as changing the
internal language of the script, that also pushes layers onto the stdio
handles.

Tests below in a [ `locale charmap` = UTF-8 ] environment.

-----------------------------< cut here >-------------------------------
#!/usr/bin/perl
use warnings;
use strict;
use Encode;

use encoding 'iso-8859-1'; # <----

my $string = encode("iso-8859-1", "cl\xe9ment");

binmode(STDOUT, ":utf8");

print join(':', PerlIO::get_layers(STDOUT)) . "\n";
print "$string\n";
-----------------------------< cut here >-------------------------------

Result is:
	stdio:encoding(iso-8859-1):utf8
	clXent
modulo a substitute character where I've put 'X'.  Comment out the "use
encoding" line and this works properly.

Same effect using "-C" instead of the explicit binmode(); you can't
describe the language that the script itself is written in without
messing up IO.

You do need either the -C or the binmode to get working output under
UTF-8.

If you want to write your script in a non-ASCII variant but use UTF-8
for stdio, using { use encoding 'whichever'; }, then make sure to also
put in { binmode STDOUT, ':raw'; } to clear the IO layers before you
push the utf8 layer on.  And deal with STDIN, etc.  And remember that
the :raw will also undo any layers put in place by the -C interpreter
switch.

All to the best of my understanding, which will probably now be ripped
to shreds.
-- 
VISTA: Viruses, Infections, Spyware, Trojans & Adware