Language recognition

Kaoru kaoru at
Mon Oct 8 18:23:06 BST 2007

On 10/8/07, Philip Newton <philip.newton at> wrote:
> Which language is not as trivial but not impossible. I think trigram
> (or, in general, n-gram) analysis is one method; different languages
> have different statistical distributions of the various n-grams, so
> once you've built a database for a given language you can use that to
> analyse data.

If you want to break away from just unicode entity recognition then
check out Lingua::Identify [0], which already uses ngrams to identify
text from a (fairly small imo) set of languages.

I haven't used it but I've found other Lingua modules useful in the past.

- Alex / Kaoru


More information about the mailing list