Should UTF-8 be a swear word ?

Marvin Humphrey marvin at
Tue Aug 8 16:11:21 BST 2006

On Aug 8, 2006, at 6:50 AM, Nicholas Clark wrote:
> * Apart from the big implementation bug that is hard to fix. If a  
> string
>   internally is encoded in UTF-8, then it is also treated with Unicode
>   semantics. For example code point 0xE9 is recognised as a lower case
>   E acute, and will uppercase to É, and it will match classes like  
> \w in
>   the regexp engine. Whereas if the same string is stored as raw  
> bytes, that
>   byte is not treated as a letter - only the ASCII letters are  
> letters, etc.
>   There is a single flag bit in the internals used both to mean  
> "characters
>   to be treated as Unicode" and "characters stored as UTF-8". If  
> two bits
>   had been used, there would be less confusion. Also, if 5.6 hadn't  
> been as
>   buggy, there would be less confusion.

What drives me nuts is what happens when two scalars, one with the  
UTF8 flag and one without, are concatenated.  Silently upgrading the  
non-UTF8 scalar is excessively helpful, IMO -- I'd have preferred a  
warning.  It took me a long time to figure out why a serialization  
routine I'd written was producing garbage.  :(

This is impossible to change now, right?

Marvin Humphrey
Rectangular Research


use strict;
use warnings;

# utf8_concat.plx -- concat utf8 and non-utf8 strings

use Encode qw( _utf8_on );

printf("%-20s%s\n", "STRING", "OCTETS");

my $foo = "foo";
print_octets('"foo"', $foo);

my $packed_num = pack('N', 128);
print_octets("packed num:", $packed_num);

print_octets("non-utf8 concat:", $foo . $packed_num);

print_octets("utf8 concat:", $foo . $packed_num);

sub print_octets {
     my ($label, $string) = @_;
     printf("%-20s", $label);
     my @octets = unpack('C*', $string);
     print "@octets\n";

More information about the mailing list