kaoru at slackwise.net
Mon Oct 8 18:23:06 BST 2007
On 10/8/07, Philip Newton <philip.newton at gmail.com> wrote:
> Which language is not as trivial but not impossible. I think trigram
> (or, in general, n-gram) analysis is one method; different languages
> have different statistical distributions of the various n-grams, so
> once you've built a database for a given language you can use that to
> analyse data.
If you want to break away from just unicode entity recognition then
check out Lingua::Identify , which already uses ngrams to identify
text from a (fairly small imo) set of languages.
I haven't used it but I've found other Lingua modules useful in the past.
- Alex / Kaoru
More information about the london.pm