Character encodings and databases

Thu Jun 19 19:48:20 BST 2014

On 19/06/2014 15:58, Andrew Hill wrote:
> My code has an extremely annoying bug that I can't quite solve.
>
> The concept is simple - read some text from a text file; update a 
> database table based on that text.
>
> The text file is UTF8 and the database is Oracle 11g.
>
> I am reading the file with a normal
> open FILE, "<blah";
> while(<FILE>) {
>     chomp;
>     $foo = $_;
> }

This code is likely to lead to problems.

If you open a UTF-8 file without specifying a layer, then when you read 
the file you will get downgrade strings containing the UTF-8 bytes, 
whereas what you want is upgraded strings containing the characters.

So in the case of a file containing a UTF-8 representation of "Zürich", 
$foo will be left with "Z\xC3\xBCrich" (and it will be a downgrade 
string, i.e. with Perl's internal UTF-8 flag off).

Consider instead doing something like this:

my $filename = 'blah;
open my $file, '<:encoding(UTF-8)', $filename or die "Can't open 
$filename: $!";

In the case of a file containing a UTF-8 representation of "Zürich", the 
resulting will be a string "Z\xFCrich" (and it will be an upgraded 
string, i.e. with Perl's internal UTF-8 flag on), which is what you want.

> Then I select the VARCHAR2 field from the table into $bar, do a 
> straight string comparison between $foo and $bar, and if they are 
> different, I update the table with the value of $foo and output a 
> debugging line to say that, for example, Z<splodge>rich has been 
> updated to Zürich.

It looks like you are using Oracle. (Probably best to state that clearly 
in questions like this.)

DBD::Oracle does "interesting" things when you try to send downgrade 
strings to the database. Empirically, it seems to treat the downgrade 
string as a UTF-8 byte sequence. So in your case this will mean you 
accidentally end up writing the "right" thing to the database.

> However, the next time I read Zürich from the file, I get exactly the 
> same behaviour, ie $bar is again Z<splodge>rich, therefore $foo ne 
> $bar and it updates the table again. I don't understand why $foo ne 
> $bar, given I've just set the field to $foo.

When you read it back, you will get a string "Z\xFCrich", which will 
compare different to "Z\xC3\xBCrich", and from what you describe your 
program will update the database again.

You may also be getting strange behaviour with your debug output, and 
the pathway between your data and your monitor may not treating UTF-8 in 
a way which is consistently useful. Consider passing your debug output 
through Data::Dump::pp so that you can properly see what's going on.

> So, as I see it, these are the possible causes:
> 1. Data is not being stored in the database as UTF8 - not sure how to 
> check when Perl is the only tool available to query it

It is and it isn't. If you want to see what's really in your strings 
then you can use:

use feature 'say';
use Data::Dump 'pp';

say '$x contains ', pp($x);
say '$x is ', utf8::is_utf8($x) ? 'upgraded' : 'downgrade';

> 2. Conversion is occuring in the DBD driver
> 3. Something else because I've been staring at it for so long
>
> FWIW, NLS_CHARACTERSET is AL32UTF8 and $ENV{NLS_LANG} is 
> AMERICAN_AMERICA.AL32UTF8

Brilliant! You appear to have set the correct options for getting 
DBD::Oracle to do Unicode (reasonably) properly. That is actually the 
hard part :-)

Though you need to ensure that $ENV{NLS_LANG} is set to a suitable 
AL32UTF8 option fairly early on.

I say "(reasonably) properly" because everything is fine with 
DBD::Oracle provided all your strings are upgraded. If you accidentally 
pass a downgrade string to DBD::Oracle then strange things happen.

Most of the time you can just get away with it, because most properly 
patrolled borders end up generating upgraded strings anyway, even when 
the text is entirely in the Latin-1 range.

To be sure, use utf8::upgrade on all strings which you want to pass into 
DBD::Oracle. Alternatively, get DBD::Oracle fixed so that it does this 
for you.

Regards,

Bill