Should UTF-8 be a swear word ?
Marvin Humphrey
marvin at rectangular.com
Tue Aug 8 16:11:21 BST 2006
On Aug 8, 2006, at 6:50 AM, Nicholas Clark wrote:
>
> * Apart from the big implementation bug that is hard to fix. If a
> string
> internally is encoded in UTF-8, then it is also treated with Unicode
> semantics. For example code point 0xE9 is recognised as a lower case
> E acute, and will uppercase to É, and it will match classes like
> \w in
> the regexp engine. Whereas if the same string is stored as raw
> bytes, that
> byte is not treated as a letter - only the ASCII letters are
> letters, etc.
>
> There is a single flag bit in the internals used both to mean
> "characters
> to be treated as Unicode" and "characters stored as UTF-8". If
> two bits
> had been used, there would be less confusion. Also, if 5.6 hadn't
> been as
> buggy, there would be less confusion.
What drives me nuts is what happens when two scalars, one with the
UTF8 flag and one without, are concatenated. Silently upgrading the
non-UTF8 scalar is excessively helpful, IMO -- I'd have preferred a
warning. It took me a long time to figure out why a serialization
routine I'd written was producing garbage. :(
This is impossible to change now, right?
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
#---------------------------------------------------
#!/usr/bin/perl
use strict;
use warnings;
# utf8_concat.plx -- concat utf8 and non-utf8 strings
use Encode qw( _utf8_on );
printf("%-20s%s\n", "STRING", "OCTETS");
my $foo = "foo";
print_octets('"foo"', $foo);
my $packed_num = pack('N', 128);
print_octets("packed num:", $packed_num);
print_octets("non-utf8 concat:", $foo . $packed_num);
_utf8_on($foo);
print_octets("utf8 concat:", $foo . $packed_num);
sub print_octets {
my ($label, $string) = @_;
printf("%-20s", $label);
my @octets = unpack('C*', $string);
print "@octets\n";
}
More information about the london.pm
mailing list