Language recognition

Peter Hickman peter.hickman at semantico.com
Mon Oct 8 17:14:18 BST 2007


Looking at the public twitter feeds I note that although they are in 
UTF8 they do not indicate the language that they are in. I realise that 
this would be somewhat difficult. But just how difficult?

Given the utf8 entities (is that the correct term) is there an easy way 
to tell which language it might be from, or at least which script?

I'm sure something could be hacked up but rather than some adhoc rules 
it would appear that this could be revered from the Unicode.

Any pointers?

-- 
Peter Hickman.

Semantico, Lees House, 21-23 Dyke Road, Brighton BN1 3FE
t: 01273 358223
f: 01273 723232
e: peter.hickman at semantico.com
w: www.semantico.com



More information about the london.pm mailing list