duduping mangled UTF8 strings

Nicholas Clark nick at ccl4.org
Mon Jun 18 10:49:36 BST 2007


On Mon, Jun 18, 2007 at 10:11:31AM +0100, Dirk Koopman wrote:
> Consider all these strings. They are all the same, but have been mangled 
> by various pieces of software (that don't understand utf8). The original 
> is obviously the last one (shame it didn't arrive first, but that is 
> part of the problem).
> 

Yes, I know you didn't mail this, but here's how my setup mangled it:

> Radio H???licopt???re combats
> Radio Hilicopthre combats
> Radio Hélicoptère combats

Do you also need to include

Radio Helicoptere combats
Radio H?licopt?re combats

?

> I would like to deduplicate them. Any version of one of these strings 
> can come in in any order. Any suggestions?

If it's just the three forms you suggest, the the heuristic seems to be that
you need to

1: see if the octet stream is valid UTF-8. If so, convert it to ISO-8859-1
2: now strip the top bit from all characters

so the smashed to ASCII version is your "canonical" form. But obviously there
could be two or more accented string that mangle to the same ASCII.


If you need to cope with the other two forms, I cant' see a reliable way to
do it. If you have characters outside the ISO-8859-1 range, offhand I can't
see a good way to "canonicalise" them.

Nicholas Clark


More information about the london.pm mailing list