Language recognition

Yuval Kogman nothingmuch at
Mon Oct 8 17:50:27 BST 2007

On Mon, Oct 08, 2007 at 17:14:18 +0100, Peter Hickman wrote:
> Looking at the public twitter feeds I note that although they are in 
> UTF8 they do not indicate the language that they are in. I realise that 
> this would be somewhat difficult. But just how difficult?
> Given the utf8 entities (is that the correct term) is there an easy way 
> to tell which language it might be from, or at least which script?
> I'm sure something could be hacked up but rather than some adhoc rules 
> it would appear that this could be revered from the Unicode.

After you've decoded the unicode and have a real unicode string
(utf8::is_utf8 returns true on the string, etc), there are various
detection modules ont he cpan.

However, it's fairly difficult to guess the language based on the
script alone. If you have a set of languages you could build corpii
of common words and run those against the text, scoring based on the
occurance of words. I suppose that could lead to a fairly accurate

This is in fact implemented by Text::Language::Guess.

An alternate method in which you build the corpii from input texts


Good luck

  Yuval Kogman <nothingmuch at>  0xEBD27418

More information about the mailing list