adrianh at quietstars.com
Tue Oct 9 09:57:58 BST 2007
On 8 Oct 2007, at 17:48, Philip Newton wrote:
> On 10/8/07, Peter Hickman <peter.hickman at semantico.com> wrote:
>> Looking at the public twitter feeds I note that although they are in
>> UTF8 they do not indicate the language that they are in. I realise
>> this would be somewhat difficult. But just how difficult?
>> Given the utf8 entities (is that the correct term) is there an
>> easy way
>> to tell which language it might be from, or at least which script?
> Which script is easy if you go by Unicode.
> Which language is not as trivial but not impossible. I think trigram
> (or, in general, n-gram) analysis is one method; different languages
> have different statistical distributions of the various n-grams, so
> once you've built a database for a given language you can use that to
> analyse data.
I did have some rules from an (old) expert system for identifying
languages lying around somewhere... I can probably dig them out. Was
only for a few European languages...
More information about the london.pm