Character encodings and databases

Thu Jun 19 15:58:03 BST 2014

My code has an extremely annoying bug that I can't quite solve.

The concept is simple - read some text from a text file; update a database 
table based on that text.

The text file is UTF8 and the database is Oracle 11g.

I am reading the file with a normal
open FILE, "<blah";
while(<FILE>) {
 	chomp;
 	$foo = $_;
}

Then I select the VARCHAR2 field from the table into $bar, do a straight 
string comparison between $foo and $bar, and if they are different, I 
update the table with the value of $foo and output a debugging line to say 
that, for example, Z<splodge>rich has been updated to Zürich.

However, the next time I read Zürich from the file, I get exactly the same 
behaviour, ie $bar is again Z<splodge>rich, therefore $foo ne $bar and it 
updates the table again. I don't understand why $foo ne $bar, given I've 
just set the field to $foo.

So, as I see it, these are the possible causes:
1. Data is not being stored in the database as UTF8 - not sure how to 
check when Perl is the only tool available to query it
2. Conversion is occuring in the DBD driver
3. Something else because I've been staring at it for so long

FWIW, NLS_CHARACTERSET is AL32UTF8 and $ENV{NLS_LANG} is 
AMERICAN_AMERICA.AL32UTF8

Cheers,
Andrew