Language recognition

Mon Oct 8 18:51:59 BST 2007

On Mon, Oct 08, 2007 at 17:14:18 +0100, Peter Hickman wrote:
> Looking at the public twitter feeds I note that although they are in 
> UTF8 they do not indicate the language that they are in. I realise that 
> this would be somewhat difficult. But just how difficult?
> 
> Given the utf8 entities (is that the correct term) is there an easy way 
> to tell which language it might be from, or at least which script?
> 
> I'm sure something could be hacked up but rather than some adhoc rules 
> it would appear that this could be revered from the Unicode.

After you've decoded the unicode and have a real unicode string
(utf8::is_utf8 returns true on the string, etc), there are various
detection modules ont he cpan.

However, it's fairly difficult to guess the language based on the
script alone. If you have a set of languages you could build corpii
of common words and run those against the text, scoring based on the
occurance of words. I suppose that could lead to a fairly accurate
guess.

This is in fact implemented by Text::Language::Guess.

An alternate method in which you build the corpii from input texts
is:

Text::Ngram::LanguageDetermine

Good luck

-- 
  Yuval Kogman <nothingmuch at woobling.org>
http://nothingmuch.woobling.org  0xEBD27418