character set detection?

Dominic Mitchell dom at happygiraffe.net
Sun Jan 7 13:13:12 GMT 2007


Dirk Koopman wrote:
> On Sat, 2007-01-06 at 23:28 +0000, Ash Berlin wrote:
>> Dirk Koopman wrote:
>>> Is there a way of, reasonably reliably, determining what the character
>>> set of a lump of text is?
>>>
>>>   
>> In a(n unhelpful) word: No. Not in a 100% reliable way anyway.
>>
>> Might want to look at http://icu.sourceforge.net/ - it has heuristics to 
>> do it (I think.)
> 
> What I have is a legacy app that was mainly put together in perl 5.004
> days. It is effectively a specialised chat daemon. People connect to one
> of a couple/four hundred nodes spread around the world using telnet or
> something similar. The node software runs on *n[iu]x boxes and windows. 
> 
> I am rewriting the core internode protocol and I want to be able to
> convert people's input to UTF8. All input is currently in some variety
> of ASCII (may include windows 3.11 codepages) or some kind of windows /
> iso-8859 code set. Just for fun, some of it appears to be in UTF8
> already! There are no asian / non-western character sets in use (at the
> moment). 
> 
> Does this reduce the locus of search at all?

That helps a bit.  In the past,I've done something similiar to:

   use Encode qw( decode FB_CROAK );
   my $chars = eval { decode('utf8', $input, FB_CROAK) } || 
decode('cp1252', $input);

Taht is,treat the input as UTF-8 by default (which *is* reliably 
recognisable, and also catches plain ASCII), and failing that, treat it 
as Windows-1252, which is (more-or-less) a superset of ISO-8859-1.

You will have difficulty distinguishing between multiple windows code 
pages, however.

-Dom


More information about the london.pm mailing list