[OT] Encode woes
djk at tobit.co.uk
Wed Sep 23 18:26:13 BST 2009
It appears that, with the increasing prevalence of 5.10, the usage of
utf8 or not is getting more picky.
I have a well established, networked, app that has upwards of 250 nodes
and about 4000 users at one time (on certain weekends double that) all
over the world. These users are running mainly windows based clients
(which may include quite a lot of windows telnet). The nominal character
set is ascii, as interpreted by the client's host operating system.
To date, I have managed to avoid the tribulations of Encode and utf8 et
al. But I am now get occasional errors, on 5.10 perl, of the ilk:-
Wide character in null operation at /spider/perl/DXDupe.pm line 47.
at /spider/perl/DXDupe.pm line 47
DXDupe::find('X14163|UA0KEF|RZ6HV|������� �������') called at
/spider/perl/Spot.pm line 420
And also something similar on print or syswrite.
Studying the data, what I am receiving is a mixture of utf8 and
iso-8859-*, the reason for this being that older perls happily take what
they are given and just pass it along. Some clients are emitting utf8
and other iso-8859 and yet others (running Win95/8) some kind of
codepage. In addition, there are older, usually windows based, packages
acting as nodes, together with yet more clients that are also adding
data to this network in who knows what character set.
Up until recently, this has not been a problem because the important
stuff is in 7 bit ascii and the remarks section (the usual source of
problems), if it is unreadable, doesn't matter 'cos you can't translate
Now, is there a reasonably reliable way of determining what we have, on
a string by string basis, to at least tell whether we are dealing with
utf8 or iso-8859 (not caring which variant) so that I can drive Encode
appropriately to avoid crashes of the above type. Or how do I
completely switch off utf8 encoding/decoding - everywhere - in an 80,000
line perl app.
More information about the london.pm