Matching algorithms

Tue Jun 24 05:20:42 BST 2008

On Mon, Jun 23, 2008 at 04:59:42PM +0100, Shevek wrote:
> On Mon, 2008-06-23 at 15:20 +1000, Toby Corkindale wrote:
> > I'm looking at comparing sets of data for similarities, and wondering if there
> > are any funky algorithms that will speed things up a bit..
> > 
> > I want to compare @list1 with @list2.. and then with @list3 and @list4, etc.
> > Then I'm probably going to compare @list2 with @list3 and @list4.. You can see
> > where this is going.
> > 
> > The thing is, I only care about getting the approximately-closest-matching
> > sets. And a lot of these sets are going to checked vs a new set over and over.
> > So I'm wondering if I could take some heuristic value from the array,
> > and use that for comparison.. kind of like hashing, but an approximate hash?
> > 
> > Any thoughts?
> 
> A few people have suggested (relatively) low-tech solutions including
> Levenstein.
> 
> The high tech solution would be to generate a hyperspace, probably on
> booleans in your case, from your data, and use something like IBk or
> spectral clustering. I think I have another algorithm which will
> outperform spectral clustering, but I haven't done a test implementation
> yet.
> 
> These approaches are nice because you get to choose your definition(s)
> of "similarity" and generate your dimensions using those definitions.
> Also, they work with non-boolean features.

Thanks; None of these terms (IBk, spectral clustering, etc) mean much to me,
sadly, but the terms are enough for me to go off and read up on them.

There's even a page on Wikipedia for nearest-neighbour algorithms with yet more
links.. I feel like I'm potentially biting off more than I can chew with all
this information, but we'll see how it works out.

thanks again for the pointers,
Toby