Character encodings and databases

William Blunn bill+london.pm at blunn.org
Fri Jun 20 20:07:05 BST 2014


On 20/06/2014 09:37, Andrew Hill wrote:
> On Thu, 19 Jun 2014, William Blunn wrote:
>
>> When you read it back, you will get a string "Z\xFCrich", which will 
>> compare different to "Z\xC3\xBCrich", and from what you describe your 
>> program will update the database again.
>
> Indeed.
>
> I have done what I should have done in the first place and replicated 
> the problem in a simple form rather than try and describe what the 
> original code was doing. So:
>
> $ENV{NLS_LANG} = "AMERICAN_AMERICA.AL32UTF8";
> $ENV{ORACLE_SID} = "blah";
> $ENV{ORAPIPES} = "V2";
> $ENV{ORACLE_HOME} = "/home/oracle/app/oracle/product/11.2.0/dbhome_1";
> my $dbh = DBI->connect("dbi:Oracle:host=127.0.0.1;sid=blah", "blah", 
> "blah", {ora_charset => 'AL32UTF8'});
> $dbh->{AutoCommit} = 0; $dbh->{RaiseError} = 1; $dbh->{PrintError} = 1;
> $dbh->{FetchHashKeyName} = 'NAME_lc';
> my $zurich = "Zürich";

OK.

Stop, stop, stop, stop, stop.

Whilst Perl has perfectly sane and well-behaved semantics for UTF-8 in 
source code, we have no idea whether you have specified "use utf8;" at 
the top of your source file, or how your source code got from the file 
to your e-mail program. So we cannot know for sure what data will end up 
in the variable $zurich.

If you /really/ know what you are doing, you /can/ put Unicode 
characters in source files, but it is generally not recommended.

The quick answer which will lead to the least confusion is to make your 
Perl source files be entirely ASCII. Do this as the default, and you 
will save yourself and others a lot of headaches.

Right here, right now, this is one massive source of confusion that we 
could really really do without.

Change this line for these two lines:

my $zurich = "Z\xFCrich";
utf8::upgrade($zurich);

This will, quickly and reliably, set up $zurich with data which is 
representative of what you might get out of a properly configured I/O layer.

Don't skip the second line. You shouldn't need to explicitly upgrade 
normally, but right here, right now, it will ensure that things are 
correct and make things work correctly.

> $dbh->do("delete from foo");
> my $sth = $dbh->prepare("INSERT INTO foo (bar) values (?)");
> $sth->execute($zurich);
> $dbh->commit;
> $sth = $dbh->prepare("select bar from foo");
> $sth->execute;
> my$foo = ($sth->fetchrow_array)[0];
> $sth->finish;
> $dbh->disconnect;
> print HexDump $foo;
> print HexDump $zurich;

You haven't shown the code which allows the reader to determine what 
HexDump is. I am guessing that it emits the individual codepoints in the 
input string in hexadecimal.

> This outputs:
>
>           00 01 02 03 04 05 06 07 - 08 09 0A 0B 0C 0D 0E 0F 
> 0123456789ABCDEF
>
> 00000000  5A FC 72 69 63 68 Z.rich
>           00 01 02 03 04 05 06 07 - 08 09 0A 0B 0C 0D 0E 0F 
> 0123456789ABCDEF
>
> 00000000  5A C3 BC 72 69 63 68 Z..rich

Whilst I can't be sure, it looks --- again --- like the database is 
doing the right thing, but you are setting up your input data incorrectly.

I /suspect/ that your text editor is working in UTF-8, but you have not 
specified "use utf8;" at the top of your source file.

Thus, the variable $zurich will be initialised to "Z\xC3\BCrich" (which 
is not what you want), and it will be a downgrade string (which is not 
what you want).

When you write this to Oracle, you will get the odd behaviour which 
happens when it is passed a downgrade string, viz. it will treat the 
string as a UTF-8 byte sequence.

This will accidentally result in the desired string "Zürich" being 
written to the database.

When you read it back, you get "Z\FCrich", which is correct.

> And:
>
> SQL> select dump(bar) from foo;
>
> DUMP(BAR)
> -------------------------------------------------------------------------------- 
>
> Typ=1 Len=7: 90,195,188,114,105,99,104
>
>
> So I know that the string is stored correctly in the table, 
> accidentally or otherwise.

Yes, that looks like the right underlying byte sequence is in the database.

Here is a test I did against an Oracle instance which is properly 
configured with AL32UTF8:

CREATE TABLE "foo" ( "bar" VARCHAR2(100) );
INSERT INTO "foo" ("bar") VALUES ("Zürich");
SELECT "bar", LENGTH("bar"), DUMP("bar") FROM "foo";

bar : Zürich
LENGTH("bar") : 6
DUMP("bar") : Typ=1 Len=7: 90,195,188,114,105,99,104

Try my SELECT against your database to confirm the (character) length as 6.

So we can see that there are 6 characters in 7 bytes.

> I know that the string is being stored correctly in $zurich,

I don't think you do. In fact your Hexdump even shows that it's being 
represented as a UTF-8 sequence rather than the proper sequence of 
codepoints.

So it seems that the opposite is true and the string is being stored 
incorrectly in $zurich.

> as I have eliminated IO.

No.

When Perl reads your source file that is also I/O. Without "use utf8;", 
any UTF-8 text will be read as UTF-8 byte sequences rather than Unicode 
strings.

> It appears it is specifically the act of fetching the row that causes 
> the "wrong" data to be returned.

No. Fetching the the data appears to be working correctly.

> What have I missed?

I think perhaps you think that you expect to see UTF-8 sequences inside 
Perl. This is wrong. Inside Perl, strings should be a sequence of 
(Unicode) codepoints.

I/O is typically byte-oriented. When you read data from outside Perl, 
you get a UTF-8 byte sequence. This should be converted by an I/O layer 
into a Perl (Unicode) string.

When you write a string out of Perl, the Perl (Unicode) string will be 
converted by an I/O layer into a UTF-8 byte sequence.

Fundamentally, it is straightforward.

Try the change I suggested above and see if your test program makes more 
sense.

One other thing you might like to put at the top of your test programs 
is this:

binmode STDOUT, ':encoding(UTF-8)';

Then provided you terminal/console program is configured for UTF-8, then 
any strings you print should be displayed correctly.

Regards,

Bill


More information about the london.pm mailing list