deduping mangled UTF8 strings
london.pm at metamathics.org
Mon Jun 18 10:36:45 BST 2007
On Mon, 2007-06-18 at 10:11 +0100, Dirk Koopman wrote:
> Consider all these strings. They are all the same, but have been mangled
> by various pieces of software (that don't understand utf8). The original
> is obviously the last one (shame it didn't arrive first, but that is
> part of the problem).
> Radio H�licopt�re combats
> Radio Hilicopthre combats
> Radio Hélicoptère combats
> I would like to deduplicate them. Any version of one of these strings
> can come in in any order. Any suggestions?
Depends on the shape of your data, but you could try a similar approach
to that used for search-matching typos - strip all the vowels (including
y and h in the case I'm thinking of - you may have some different
consonants that you need to lose) and then you can find matches with the
resulting strings. Obviously false positives might be a problem, hence
'depends on the shape of your data'.
Not sure how you're going to pull out the correct version though... in
the event of duplicates, look for a string containing valid UTF8 data?
More information about the london.pm