Bayesian Classification of CPAN Module Failures (Re: Module dependencies and test results)

Sun Aug 5 12:29:39 BST 2007

Marvin Humphrey wrote:
> Any math wizards on this list that can come up with a formula expressing 
> the likelihood that an attempted installation will fail, based on how 
> numerous and dubious the dependencies are?

I'm no math wizard, but I have written a document classifier or two in my 
time.  It looks like this is essentially the same kind of problem.

For combining probabilities, the Reverend Bayes[1] is your man:

   P(x|y) = P(y|x) P(x)
            -----------
               P(y)

It says "The probability of x given y, is the probability of y given x,
multiplied by the probability of x, divided by the probability of y".

Paul Graham famously applied this to the problem of spam detection[2]. 
Something like this:

   P(spam|words) = P(words|spam) P(spam)
                   ---------------------
                         P(words)

Or in other words, "The probability of a message being spam given the list of 
words it contains, is the combined probability of each of those words 
appearing in any spam message, multiplied by the probability of any message 
being spam, divided by the combined probability of those words appearing in 
any document.

The nice thing is that you don't have to think just in terms of binary 
spam/ham, but can have as many different categories as you like.  Thus, it's a 
good basis for a simple word-based document classification system.  But I digress.

To apply it to the problem at hand which _is_ a binary pass/fail classification:

   P(fail|deps) = P(deps|fail) P(fail)
                  --------------------
                         P(deps)

So the probability of a distribution failing to install given a list of 
modules it depends on, is the combined probability of each of those dependent 
modules being dependencies in distributions that fail, multiplied by the 
probability of any distribution failing to install, divided by the combined 
probability of those modules being dependencies in any distribution.

P(deps|fail) is the product of P(dep|fail) for dep over deps. i.e.

   P(deps|fail) = P(dep1|fail) * P(dep2|fail) * P(dep3|fail) ... P(depn|fail)

P(dep|fail) is the probability of a dependent module being a failure (for all 
failures of all modules), i.e.

   P(dep|fail) = total_number_of_failure_for_all_modules
                 ---------------------------------------
                 total_number_of_failures_for_module_dep

P(deps) is the product of P(dep) for dep over deps, and P(dep) is the 
probability of a module appearing as a dependency.

   P(deps) = P(dep1) * P(dep2) * P(dep3) ... P(depn)

   P(dep)  =   total_number_of_dependencies_for_all_modules
             ------------------------------------------------
             total_number_of_times_module_dep_is_a_dependency

You can pre-compute P(fail), P(dep) and P(dep|fail) from your a-priori corpus 
(i.e. CPAN as it stands now).  Then to compute the probability of failure for 
any set of dependencies, you find the product of P(dep) and P(dep|fail) forall 
deps from your pre-computed values and glue the totals into the main equation.

[ NOTE: there's more you can do to optimise classification performance,
   particularly when dealing with large documents.  e.g. storing the log
   probability of P(dep|fail)/P(dep) so you can run a query like:

     SELECT EXP(SUM(log_prog)) WHERE dep IN (@deps)

   But that's probably outside the scope of this simple case where the number
   of dependencies is small (say, a few dozen at most, compared to a document
   which might have many thousands of distinct words in it) ]

I still find it all rather mind-boggling clever.  But like I say, I'm no math 
wiz so take everything I say with a pinch of salt.

A

[1] http://en.wikipedia.org/wiki/Bayes'_theorem
[2] http://www.paulgraham.com/spam.html