duduping mangled UTF8 strings
nick at ccl4.org
Mon Jun 18 10:49:36 BST 2007
On Mon, Jun 18, 2007 at 10:11:31AM +0100, Dirk Koopman wrote:
> Consider all these strings. They are all the same, but have been mangled
> by various pieces of software (that don't understand utf8). The original
> is obviously the last one (shame it didn't arrive first, but that is
> part of the problem).
Yes, I know you didn't mail this, but here's how my setup mangled it:
> Radio H???licopt???re combats
> Radio Hilicopthre combats
> Radio Hélicoptère combats
Do you also need to include
Radio Helicoptere combats
Radio H?licopt?re combats
> I would like to deduplicate them. Any version of one of these strings
> can come in in any order. Any suggestions?
If it's just the three forms you suggest, the the heuristic seems to be that
you need to
1: see if the octet stream is valid UTF-8. If so, convert it to ISO-8859-1
2: now strip the top bit from all characters
so the smashed to ASCII version is your "canonical" form. But obviously there
could be two or more accented string that mangle to the same ASCII.
If you need to cope with the other two forms, I cant' see a reliable way to
do it. If you have characters outside the ISO-8859-1 range, offhand I can't
see a good way to "canonicalise" them.
More information about the london.pm