Language recognition

Kaoru kaoru at slackwise.net
Mon Oct 8 18:23:06 BST 2007


On 10/8/07, Philip Newton <philip.newton at gmail.com> wrote:
> Which language is not as trivial but not impossible. I think trigram
> (or, in general, n-gram) analysis is one method; different languages
> have different statistical distributions of the various n-grams, so
> once you've built a database for a given language you can use that to
> analyse data.
>

If you want to break away from just unicode entity recognition then
check out Lingua::Identify [0], which already uses ngrams to identify
text from a (fairly small imo) set of languages.

I haven't used it but I've found other Lingua modules useful in the past.

- Alex / Kaoru

[0] http://search.cpan.org/~cog/Lingua-Identify-0.19/lib/Lingua/Identify.pm


More information about the london.pm mailing list