Matching algorithms

Tue Jun 24 10:05:37 BST 2008

On Mon, Jun 23, 2008 at 4:59 PM, Shevek <shevek at anarres.org> wrote:
> A few people have suggested (relatively) low-tech solutions including
> Levenstein.
>

I agree, Levenshtein can sometimes produce ridiculously misleading
results. This is because it always looks for the cheapest way
to transform one string into the other by single actions while this
may not be the case.

> The high tech solution would be to generate a hyperspace, probably on
> booleans in your case, from your data, and use something like IBk or
> spectral clustering. I think I have another algorithm which will
> outperform spectral clustering, but I haven't done a test implementation
> yet.
>
> These approaches are nice because you get to choose your definition(s)
> of "similarity" and generate your dimensions using those definitions.
> Also, they work with non-boolean features.
>

Clustering assumes that the classes individual data points belong to are not
predefined and as of yet remain unknown. I think the original problem
specification is not that well defined since it might not be necessary to follow
such a complex solution and one might get away with using a standard stemming
algorithm (such as Porter's) and IDF weighted vector similarity based
on the cosine
or Jaccard coefficient.

Regardless, clustering is definitely a superior solution than
traditional string-manipulation
approaches yet one has to be cautious on what to actually look for
when trying to
interpret the results.

Spiros

> S.
>
> --
> Shevek
> "I have never killed a man, but I have read many obituaries with great
> pleasure." - Clarence Darrow
>
> "I didn't attend the funeral, but I sent a nice letter saying I approved
> of it." - Mark Twain
>
>