deduping mangled UTF8 strings

Mon Jun 18 10:36:45 BST 2007

On Mon, 2007-06-18 at 10:11 +0100, Dirk Koopman wrote:
> Consider all these strings. They are all the same, but have been mangled 
> by various pieces of software (that don't understand utf8). The original 
> is obviously the last one (shame it didn't arrive first, but that is 
> part of the problem).
> 
> Radio H�licopt�re combats
> Radio Hilicopthre combats
> Radio Hélicoptère combats
> 
> I would like to deduplicate them. Any version of one of these strings 
> can come in in any order. Any suggestions?

Depends on the shape of your data, but you could try a similar approach
to that used for search-matching typos - strip all the vowels (including
y and h in the case I'm thinking of - you may have some different
consonants that you need to lose) and then you can find matches with the
resulting strings.  Obviously false positives might be a problem, hence
'depends on the shape of your data'.

Not sure how you're going to pull out the correct version though...  in
the event of duplicates, look for a string containing valid UTF8 data?

Regards,
Denny