Language recognition

Adrian Howard adrianh at quietstars.com
Tue Oct 9 09:57:58 BST 2007


On 8 Oct 2007, at 17:48, Philip Newton wrote:

> On 10/8/07, Peter Hickman <peter.hickman at semantico.com> wrote:
>> Looking at the public twitter feeds I note that although they are in
>> UTF8 they do not indicate the language that they are in. I realise  
>> that
>> this would be somewhat difficult. But just how difficult?
>>
>> Given the utf8 entities (is that the correct term) is there an  
>> easy way
>> to tell which language it might be from, or at least which script?
>
> Which script is easy if you go by Unicode.
>
> Which language is not as trivial but not impossible. I think trigram
> (or, in general, n-gram) analysis is one method; different languages
> have different statistical distributions of the various n-grams, so
> once you've built a database for a given language you can use that to
> analyse data.

I did have some rules from an (old) expert system for identifying  
languages lying around somewhere... I can probably dig them out. Was  
only for a few European languages...

Adrian


More information about the london.pm mailing list