Cpanratings etc - Re: Devel::CheckLib: Please try to break our code!
Marvin Humphrey
marvin at rectangular.com
Mon Oct 22 17:45:39 BST 2007
On Oct 20, 2007, at 3:52 AM, Andy Armstrong wrote:
> More specifically why does search.cpan.org have such a hard time
> with one word module names? Try finding CGI.pm - or as I did this
> week in response to Lyle's module - FCGI.pm. You'd think that
> typing FCGI into the search box would do the trick, right? Nope.
> FCGI.pm? Nope. FastCGI? Nope. It's effectively completely broken
> for that case.
Since I can't snoop the code that search.cpan.org uses, I can't go
and fix this as I would like to. However, in principle, biasing
towards short module names certainly ought to be possible, and I
agree that it's desirable. I experienced the same frustration a
while back searching for the exact same module.
Length normalization is a standard component of the classic TF/IDF
weighting model, and once you have a document corpus of significant
size, tuning your engine to account for field length yields
significant dividends. Biasing towards fewer tokens is usually
desirable for fields like "title", but undesirable for "body" and the
like, because then you end up with insignificant short docs rising to
the top. I've pasted an explanation below my sig, taken from the
documentation for KSx::Search::LongFieldSim.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
DESCRIPTION
KinoSearch's default Similarity implementation produces a
bias towards
extremely short fields.
KinoSearch::Search::Similarity
| more weight
| *
| **
| ***
| **********
| ********************
|
*******************************
| less
weight ****
|-----------------------------------------------------------------------
-
fewer
tokens more tokens
LongFieldSim eliminates this bias.
KSx::Search::LongFieldSim
| more weight
|
|
|
|*****************
| ********************
|
*******************************
| less
weight ****
|-----------------------------------------------------------------------
-
fewer
tokens more tokens
In most cases, the default bias towards short fields is
desirable. For
instance, say you have two documents:
o "George Washington"
o "George Washington Carver"
If a user searches for "george washington", we want the exact
title
match to appear first. Under the default Similarity
implementation it
will, because the "Carver" in "George Washington Carver"
dilutes the
impact of the other two tokens.
However, under LongFieldSim, the two titles will yield equal
scores.
That would be bad in this particular case, but it could be
good in
another.
"George Washington Carver is cool."
"George Washington Carver was born on the eve of the US
Civil War, in
1864. His exact date of birth is unknown... Carver's
research in crop
rotation revolutionized agriculture..."
The first document is succinct, but useless. Unfortunately, the
default similarity will assess it as extremely relevant to a
query of
"george washington carver". However, under LongFieldSim, the
short-
field bias is eliminated, and the addition of other mentions of
Carver's name in the second document yield a higher score and
a higher
rank.
More information about the london.pm
mailing list