character set detection?

Dirk Koopman djk at tobit.co.uk
Sun Jan 7 12:05:43 GMT 2007


On Sat, 2007-01-06 at 23:28 +0000, Ash Berlin wrote:
> Dirk Koopman wrote:
> > Is there a way of, reasonably reliably, determining what the character
> > set of a lump of text is?
> >
> >   
> In a(n unhelpful) word: No. Not in a 100% reliable way anyway.
> 
> Might want to look at http://icu.sourceforge.net/ - it has heuristics to 
> do it (I think.)

What I have is a legacy app that was mainly put together in perl 5.004
days. It is effectively a specialised chat daemon. People connect to one
of a couple/four hundred nodes spread around the world using telnet or
something similar. The node software runs on *n[iu]x boxes and windows. 

I am rewriting the core internode protocol and I want to be able to
convert people's input to UTF8. All input is currently in some variety
of ASCII (may include windows 3.11 codepages) or some kind of windows /
iso-8859 code set. Just for fun, some of it appears to be in UTF8
already! There are no asian / non-western character sets in use (at the
moment). 

Does this reduce the locus of search at all?

Dirk
 



More information about the london.pm mailing list