Character encodings and databases
William Blunn
bill+london.pm at blunn.org
Thu Jun 19 19:48:20 BST 2014
On 19/06/2014 15:58, Andrew Hill wrote:
> My code has an extremely annoying bug that I can't quite solve.
>
> The concept is simple - read some text from a text file; update a
> database table based on that text.
>
> The text file is UTF8 and the database is Oracle 11g.
>
> I am reading the file with a normal
> open FILE, "<blah";
> while(<FILE>) {
> chomp;
> $foo = $_;
> }
This code is likely to lead to problems.
If you open a UTF-8 file without specifying a layer, then when you read
the file you will get downgrade strings containing the UTF-8 bytes,
whereas what you want is upgraded strings containing the characters.
So in the case of a file containing a UTF-8 representation of "Zürich",
$foo will be left with "Z\xC3\xBCrich" (and it will be a downgrade
string, i.e. with Perl's internal UTF-8 flag off).
Consider instead doing something like this:
my $filename = 'blah;
open my $file, '<:encoding(UTF-8)', $filename or die "Can't open
$filename: $!";
In the case of a file containing a UTF-8 representation of "Zürich", the
resulting will be a string "Z\xFCrich" (and it will be an upgraded
string, i.e. with Perl's internal UTF-8 flag on), which is what you want.
> Then I select the VARCHAR2 field from the table into $bar, do a
> straight string comparison between $foo and $bar, and if they are
> different, I update the table with the value of $foo and output a
> debugging line to say that, for example, Z<splodge>rich has been
> updated to Zürich.
It looks like you are using Oracle. (Probably best to state that clearly
in questions like this.)
DBD::Oracle does "interesting" things when you try to send downgrade
strings to the database. Empirically, it seems to treat the downgrade
string as a UTF-8 byte sequence. So in your case this will mean you
accidentally end up writing the "right" thing to the database.
> However, the next time I read Zürich from the file, I get exactly the
> same behaviour, ie $bar is again Z<splodge>rich, therefore $foo ne
> $bar and it updates the table again. I don't understand why $foo ne
> $bar, given I've just set the field to $foo.
When you read it back, you will get a string "Z\xFCrich", which will
compare different to "Z\xC3\xBCrich", and from what you describe your
program will update the database again.
You may also be getting strange behaviour with your debug output, and
the pathway between your data and your monitor may not treating UTF-8 in
a way which is consistently useful. Consider passing your debug output
through Data::Dump::pp so that you can properly see what's going on.
> So, as I see it, these are the possible causes:
> 1. Data is not being stored in the database as UTF8 - not sure how to
> check when Perl is the only tool available to query it
It is and it isn't. If you want to see what's really in your strings
then you can use:
use feature 'say';
use Data::Dump 'pp';
say '$x contains ', pp($x);
say '$x is ', utf8::is_utf8($x) ? 'upgraded' : 'downgrade';
> 2. Conversion is occuring in the DBD driver
> 3. Something else because I've been staring at it for so long
>
> FWIW, NLS_CHARACTERSET is AL32UTF8 and $ENV{NLS_LANG} is
> AMERICAN_AMERICA.AL32UTF8
Brilliant! You appear to have set the correct options for getting
DBD::Oracle to do Unicode (reasonably) properly. That is actually the
hard part :-)
Though you need to ensure that $ENV{NLS_LANG} is set to a suitable
AL32UTF8 option fairly early on.
I say "(reasonably) properly" because everything is fine with
DBD::Oracle provided all your strings are upgraded. If you accidentally
pass a downgrade string to DBD::Oracle then strange things happen.
Most of the time you can just get away with it, because most properly
patrolled borders end up generating upgraded strings anyway, even when
the text is entirely in the Latin-1 range.
To be sure, use utf8::upgrade on all strings which you want to pass into
DBD::Oracle. Alternatively, get DBD::Oracle fixed so that it does this
for you.
Regards,
Bill
More information about the london.pm
mailing list