Matching algorithms

Mon Jun 23 07:03:42 BST 2008

I don't have to read the entire program in detail right now, but
here are a few leads.

Generally the way to index this is using inverted indexes, where
every element poitns to the lists it's a member in, and that way you
can cut the search space down, because you can exclude everything
that doesn't have that element.

This fails when you have few distinct elements (e.g. in a spell
checker, if every element in the index was a letter you probably
wouldn't gain muc at all).

There are several solutions:

If the lists are all of roughly the same size, you can hash chunks
of the hash, where the key is not only the value but also the
position.

The ideal chunk size is the largest sublist that must be the same.
e.g. for 100 elements with 90% bit similarity this is 9 elements,
because even if the elements are evenly distributed at least one of
the sublists still has to be identical. The edit distance might
actually be smaller (more similar) than the bit difference though.

Then you go through the candidate, testing the chunks in turn, and
doing the expensive comparison on any match.

The rationale is that for a given threshold of similarity, at least
some of the data must be similar, and you can choose a chunk size
based on the desired similarity threshold.

Another way to do a related form matching is using bloom filters (see
http://en.wikipedia.org/wiki/Bloom_filter for a detailed
explanation), which is a similar approach that can be either more or
less efficient depending on the type of data you are dealing with.
There are several implementations of bloom filters on the CPAN.
Bloom filters are like a hash where false positives are allowed, but
false negatives aren't, and you can control the accuracy. It might
be efficient to use bloom filters for the elements instead of
hashes.

The last thing you might want to look at is the Nilsimsa hash
(Digest::Nilsimsa is on the CPAN). This produces a hash that can be
compared, and the bit difference roughly translates to input
difference. You can hash all your lists, and then index the hashes
with the techniques mentioned above, which should be easier than
indexing the whole list of elements, and will probably produce
smaller indices, but more false positives. More on Nilsimsa here:
http://ixazon.dynip.com/~cmeclax/nilsimsa.html

Regards,
Yuval

On Mon, Jun 23, 2008 at 15:20:02 +1000, Toby Corkindale wrote:
> I'm looking at comparing sets of data for similarities, and wondering if there
> are any funky algorithms that will speed things up a bit..
> 
> I want to compare @list1 with @list2.. and then with @list3 and @list4, etc.
> Then I'm probably going to compare @list2 with @list3 and @list4.. You can see
> where this is going.
> 
> The thing is, I only care about getting the approximately-closest-matching
> sets. And a lot of these sets are going to checked vs a new set over and over.
> So I'm wondering if I could take some heuristic value from the array,
> and use that for comparison.. kind of like hashing, but an approximate hash?
> 
> Any thoughts?
> 
> I'll attach an example program that demonstrates what I'm doing, and uses the
> slow loops-inside-loops method :(
> 
> Cheers,
> Toby

-- 
  Yuval Kogman <nothingmuch at woobling.org>
http://nothingmuch.woobling.org  0xEBD27418