Character encodings and databases
William Blunn
bill+london.pm at blunn.org
Fri Jun 20 20:07:05 BST 2014
On 20/06/2014 09:37, Andrew Hill wrote:
> On Thu, 19 Jun 2014, William Blunn wrote:
>
>> When you read it back, you will get a string "Z\xFCrich", which will
>> compare different to "Z\xC3\xBCrich", and from what you describe your
>> program will update the database again.
>
> Indeed.
>
> I have done what I should have done in the first place and replicated
> the problem in a simple form rather than try and describe what the
> original code was doing. So:
>
> $ENV{NLS_LANG} = "AMERICAN_AMERICA.AL32UTF8";
> $ENV{ORACLE_SID} = "blah";
> $ENV{ORAPIPES} = "V2";
> $ENV{ORACLE_HOME} = "/home/oracle/app/oracle/product/11.2.0/dbhome_1";
> my $dbh = DBI->connect("dbi:Oracle:host=127.0.0.1;sid=blah", "blah",
> "blah", {ora_charset => 'AL32UTF8'});
> $dbh->{AutoCommit} = 0; $dbh->{RaiseError} = 1; $dbh->{PrintError} = 1;
> $dbh->{FetchHashKeyName} = 'NAME_lc';
> my $zurich = "Zürich";
OK.
Stop, stop, stop, stop, stop.
Whilst Perl has perfectly sane and well-behaved semantics for UTF-8 in
source code, we have no idea whether you have specified "use utf8;" at
the top of your source file, or how your source code got from the file
to your e-mail program. So we cannot know for sure what data will end up
in the variable $zurich.
If you /really/ know what you are doing, you /can/ put Unicode
characters in source files, but it is generally not recommended.
The quick answer which will lead to the least confusion is to make your
Perl source files be entirely ASCII. Do this as the default, and you
will save yourself and others a lot of headaches.
Right here, right now, this is one massive source of confusion that we
could really really do without.
Change this line for these two lines:
my $zurich = "Z\xFCrich";
utf8::upgrade($zurich);
This will, quickly and reliably, set up $zurich with data which is
representative of what you might get out of a properly configured I/O layer.
Don't skip the second line. You shouldn't need to explicitly upgrade
normally, but right here, right now, it will ensure that things are
correct and make things work correctly.
> $dbh->do("delete from foo");
> my $sth = $dbh->prepare("INSERT INTO foo (bar) values (?)");
> $sth->execute($zurich);
> $dbh->commit;
> $sth = $dbh->prepare("select bar from foo");
> $sth->execute;
> my$foo = ($sth->fetchrow_array)[0];
> $sth->finish;
> $dbh->disconnect;
> print HexDump $foo;
> print HexDump $zurich;
You haven't shown the code which allows the reader to determine what
HexDump is. I am guessing that it emits the individual codepoints in the
input string in hexadecimal.
> This outputs:
>
> 00 01 02 03 04 05 06 07 - 08 09 0A 0B 0C 0D 0E 0F
> 0123456789ABCDEF
>
> 00000000 5A FC 72 69 63 68 Z.rich
> 00 01 02 03 04 05 06 07 - 08 09 0A 0B 0C 0D 0E 0F
> 0123456789ABCDEF
>
> 00000000 5A C3 BC 72 69 63 68 Z..rich
Whilst I can't be sure, it looks --- again --- like the database is
doing the right thing, but you are setting up your input data incorrectly.
I /suspect/ that your text editor is working in UTF-8, but you have not
specified "use utf8;" at the top of your source file.
Thus, the variable $zurich will be initialised to "Z\xC3\BCrich" (which
is not what you want), and it will be a downgrade string (which is not
what you want).
When you write this to Oracle, you will get the odd behaviour which
happens when it is passed a downgrade string, viz. it will treat the
string as a UTF-8 byte sequence.
This will accidentally result in the desired string "Zürich" being
written to the database.
When you read it back, you get "Z\FCrich", which is correct.
> And:
>
> SQL> select dump(bar) from foo;
>
> DUMP(BAR)
> --------------------------------------------------------------------------------
>
> Typ=1 Len=7: 90,195,188,114,105,99,104
>
>
> So I know that the string is stored correctly in the table,
> accidentally or otherwise.
Yes, that looks like the right underlying byte sequence is in the database.
Here is a test I did against an Oracle instance which is properly
configured with AL32UTF8:
CREATE TABLE "foo" ( "bar" VARCHAR2(100) );
INSERT INTO "foo" ("bar") VALUES ("Zürich");
SELECT "bar", LENGTH("bar"), DUMP("bar") FROM "foo";
bar : Zürich
LENGTH("bar") : 6
DUMP("bar") : Typ=1 Len=7: 90,195,188,114,105,99,104
Try my SELECT against your database to confirm the (character) length as 6.
So we can see that there are 6 characters in 7 bytes.
> I know that the string is being stored correctly in $zurich,
I don't think you do. In fact your Hexdump even shows that it's being
represented as a UTF-8 sequence rather than the proper sequence of
codepoints.
So it seems that the opposite is true and the string is being stored
incorrectly in $zurich.
> as I have eliminated IO.
No.
When Perl reads your source file that is also I/O. Without "use utf8;",
any UTF-8 text will be read as UTF-8 byte sequences rather than Unicode
strings.
> It appears it is specifically the act of fetching the row that causes
> the "wrong" data to be returned.
No. Fetching the the data appears to be working correctly.
> What have I missed?
I think perhaps you think that you expect to see UTF-8 sequences inside
Perl. This is wrong. Inside Perl, strings should be a sequence of
(Unicode) codepoints.
I/O is typically byte-oriented. When you read data from outside Perl,
you get a UTF-8 byte sequence. This should be converted by an I/O layer
into a Perl (Unicode) string.
When you write a string out of Perl, the Perl (Unicode) string will be
converted by an I/O layer into a UTF-8 byte sequence.
Fundamentally, it is straightforward.
Try the change I suggested above and see if your test program makes more
sense.
One other thing you might like to put at the top of your test programs
is this:
binmode STDOUT, ':encoding(UTF-8)';
Then provided you terminal/console program is configured for UTF-8, then
any strings you print should be displayed correctly.
Regards,
Bill
More information about the london.pm
mailing list