Language recognition

Philip Newton philip.newton at gmail.com
Mon Oct 8 17:48:36 BST 2007


On 10/8/07, Peter Hickman <peter.hickman at semantico.com> wrote:
> Looking at the public twitter feeds I note that although they are in
> UTF8 they do not indicate the language that they are in. I realise that
> this would be somewhat difficult. But just how difficult?
>
> Given the utf8 entities (is that the correct term) is there an easy way
> to tell which language it might be from, or at least which script?

Which script is easy if you go by Unicode.

Which language is not as trivial but not impossible. I think trigram
(or, in general, n-gram) analysis is one method; different languages
have different statistical distributions of the various n-grams, so
once you've built a database for a given language you can use that to
analyse data.

Cheers,
-- 
Philip Newton <philip.newton at gmail.com>


More information about the london.pm mailing list