Matching algorithms

Mon Jun 23 16:59:42 BST 2008

On Mon, 2008-06-23 at 15:20 +1000, Toby Corkindale wrote:
> I'm looking at comparing sets of data for similarities, and wondering if there
> are any funky algorithms that will speed things up a bit..
> 
> I want to compare @list1 with @list2.. and then with @list3 and @list4, etc.
> Then I'm probably going to compare @list2 with @list3 and @list4.. You can see
> where this is going.
> 
> The thing is, I only care about getting the approximately-closest-matching
> sets. And a lot of these sets are going to checked vs a new set over and over.
> So I'm wondering if I could take some heuristic value from the array,
> and use that for comparison.. kind of like hashing, but an approximate hash?
> 
> Any thoughts?

A few people have suggested (relatively) low-tech solutions including
Levenstein.

The high tech solution would be to generate a hyperspace, probably on
booleans in your case, from your data, and use something like IBk or
spectral clustering. I think I have another algorithm which will
outperform spectral clustering, but I haven't done a test implementation
yet.

These approaches are nice because you get to choose your definition(s)
of "similarity" and generate your dimensions using those definitions.
Also, they work with non-boolean features.

S.

-- 
Shevek
"I have never killed a man, but I have read many obituaries with great
pleasure." - Clarence Darrow

"I didn't attend the funeral, but I sent a nice letter saying I approved
of it." - Mark Twain