character set detection?
Dominic Mitchell
dom at happygiraffe.net
Sun Jan 7 13:13:12 GMT 2007
Dirk Koopman wrote:
> On Sat, 2007-01-06 at 23:28 +0000, Ash Berlin wrote:
>> Dirk Koopman wrote:
>>> Is there a way of, reasonably reliably, determining what the character
>>> set of a lump of text is?
>>>
>>>
>> In a(n unhelpful) word: No. Not in a 100% reliable way anyway.
>>
>> Might want to look at http://icu.sourceforge.net/ - it has heuristics to
>> do it (I think.)
>
> What I have is a legacy app that was mainly put together in perl 5.004
> days. It is effectively a specialised chat daemon. People connect to one
> of a couple/four hundred nodes spread around the world using telnet or
> something similar. The node software runs on *n[iu]x boxes and windows.
>
> I am rewriting the core internode protocol and I want to be able to
> convert people's input to UTF8. All input is currently in some variety
> of ASCII (may include windows 3.11 codepages) or some kind of windows /
> iso-8859 code set. Just for fun, some of it appears to be in UTF8
> already! There are no asian / non-western character sets in use (at the
> moment).
>
> Does this reduce the locus of search at all?
That helps a bit. In the past,I've done something similiar to:
use Encode qw( decode FB_CROAK );
my $chars = eval { decode('utf8', $input, FB_CROAK) } ||
decode('cp1252', $input);
Taht is,treat the input as UTF-8 by default (which *is* reliably
recognisable, and also catches plain ASCII), and failing that, treat it
as Windows-1252, which is (more-or-less) a superset of ISO-8859-1.
You will have difficulty distinguishing between multiple windows code
pages, however.
-Dom
More information about the london.pm
mailing list