Cpanratings etc - Re: Devel::CheckLib: Please try to break our code!

Mon Oct 22 17:45:39 BST 2007

On Oct 20, 2007, at 3:52 AM, Andy Armstrong wrote:

> More specifically why does search.cpan.org have such a hard time  
> with one word module names? Try finding CGI.pm - or as I did this  
> week in response to Lyle's module - FCGI.pm. You'd think that  
> typing FCGI into the search box would do the trick, right? Nope.  
> FCGI.pm? Nope. FastCGI? Nope. It's effectively completely broken  
> for that case.

Since I can't snoop the code that search.cpan.org uses, I can't go  
and fix this as I would like to.   However, in principle, biasing  
towards short module names certainly ought to be possible, and I  
agree that it's desirable.  I experienced the same frustration a  
while back searching for the exact same module.

Length normalization is a standard component of the classic TF/IDF  
weighting model, and once you have a document corpus of significant  
size, tuning your engine to account for field length yields  
significant dividends.  Biasing towards fewer tokens is usually  
desirable for fields like "title", but undesirable for "body" and the  
like, because then you end up with insignificant short docs rising to  
the top.  I've pasted an explanation below my sig, taken from the  
documentation for KSx::Search::LongFieldSim.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

DESCRIPTION

        KinoSearch's default Similarity implementation produces a  
bias towards
        extremely short fields.

            KinoSearch::Search::Similarity

            | more weight
            | *
            |  **
            |    ***
            |       **********
            |                 ********************
            |                                      
*******************************
            | less  
weight                                                        ****

|----------------------------------------------------------------------- 
-
              fewer  
tokens                                              more tokens

        LongFieldSim eliminates this bias.

            KSx::Search::LongFieldSim

            | more weight
            |
            |
            |
            |*****************
            |                 ********************
            |                                      
*******************************
            | less  
weight                                                        ****

|----------------------------------------------------------------------- 
-
              fewer  
tokens                                              more tokens

        In most cases, the default bias towards short fields is  
desirable.  For
        instance, say you have two documents:

        o   "George Washington"

        o   "George Washington Carver"

        If a user searches for "george washington", we want the exact  
title
        match to appear first.  Under the default Similarity  
implementation it
        will, because the "Carver" in "George Washington Carver"  
dilutes the
        impact of the other two tokens.

        However, under LongFieldSim, the two titles will yield equal  
scores.
        That would be bad in this particular case, but it could be  
good in
        another.

             "George Washington Carver is cool."

             "George Washington Carver was born on the eve of the US  
Civil War, in
             1864.  His exact date of birth is unknown... Carver's  
research in crop
             rotation revolutionized agriculture..."

        The first document is succinct, but useless.  Unfortunately, the
        default similarity will assess it as extremely relevant to a  
query of
        "george washington carver".  However, under LongFieldSim, the  
short-
        field bias is eliminated, and the addition of other mentions of
        Carver's name in the second document yield a higher score and  
a higher
        rank.