[OT] Encode woes

Wed Sep 23 18:26:13 BST 2009

It appears that, with the increasing prevalence of 5.10, the usage of 
utf8 or not is getting more picky.

I have a well established, networked, app that has upwards of 250 nodes 
and about 4000 users at one time (on certain weekends double that) all 
over the world. These users are running mainly windows based clients 
(which may include quite a lot of windows telnet). The nominal character 
set is ascii, as interpreted by the client's host operating system.

To date, I have managed to avoid the tribulations of Encode and utf8 et 
al. But I am now get occasional errors, on 5.10 perl, of the ilk:-

  Wide character in null operation at /spider/perl/DXDupe.pm line 47.
  at /spider/perl/DXDupe.pm line 47
	DXDupe::find('X14163|UA0KEF|RZ6HV|�������  �������') called at 
/spider/perl/Spot.pm line 420

And also something similar on print or syswrite.

Studying the data, what I am receiving is a mixture of utf8 and 
iso-8859-*, the reason for this being that older perls happily take what 
they are given and just pass it along. Some clients are emitting utf8 
and other iso-8859 and yet others (running Win95/8) some kind of 
codepage. In addition, there are older, usually windows based, packages 
acting as nodes, together with yet more clients that are also adding 
data to this network in who knows what character set.

Up until recently, this has not been a problem because the important 
stuff is in 7 bit ascii and the remarks section (the usual source of 
problems), if it is unreadable, doesn't matter 'cos you can't translate 
it anyway.

Now, is there a reasonably reliable way of determining what we have, on 
a string by string basis, to at least tell whether we are dealing with 
utf8 or iso-8859 (not caring which variant) so that I can drive Encode 
appropriately to avoid crashes of the above type.  Or how do I 
completely switch off utf8 encoding/decoding - everywhere - in an 80,000 
line perl app.

Ta

Dirk