Bayesian Classification of CPAN Module Failures (Re: Module dependencies and test results)
Andy Wardley
abw at wardley.org
Sun Aug 5 12:29:39 BST 2007
Marvin Humphrey wrote:
> Any math wizards on this list that can come up with a formula expressing
> the likelihood that an attempted installation will fail, based on how
> numerous and dubious the dependencies are?
I'm no math wizard, but I have written a document classifier or two in my
time. It looks like this is essentially the same kind of problem.
For combining probabilities, the Reverend Bayes[1] is your man:
P(x|y) = P(y|x) P(x)
-----------
P(y)
It says "The probability of x given y, is the probability of y given x,
multiplied by the probability of x, divided by the probability of y".
Paul Graham famously applied this to the problem of spam detection[2].
Something like this:
P(spam|words) = P(words|spam) P(spam)
---------------------
P(words)
Or in other words, "The probability of a message being spam given the list of
words it contains, is the combined probability of each of those words
appearing in any spam message, multiplied by the probability of any message
being spam, divided by the combined probability of those words appearing in
any document.
The nice thing is that you don't have to think just in terms of binary
spam/ham, but can have as many different categories as you like. Thus, it's a
good basis for a simple word-based document classification system. But I digress.
To apply it to the problem at hand which _is_ a binary pass/fail classification:
P(fail|deps) = P(deps|fail) P(fail)
--------------------
P(deps)
So the probability of a distribution failing to install given a list of
modules it depends on, is the combined probability of each of those dependent
modules being dependencies in distributions that fail, multiplied by the
probability of any distribution failing to install, divided by the combined
probability of those modules being dependencies in any distribution.
P(deps|fail) is the product of P(dep|fail) for dep over deps. i.e.
P(deps|fail) = P(dep1|fail) * P(dep2|fail) * P(dep3|fail) ... P(depn|fail)
P(dep|fail) is the probability of a dependent module being a failure (for all
failures of all modules), i.e.
P(dep|fail) = total_number_of_failure_for_all_modules
---------------------------------------
total_number_of_failures_for_module_dep
P(deps) is the product of P(dep) for dep over deps, and P(dep) is the
probability of a module appearing as a dependency.
P(deps) = P(dep1) * P(dep2) * P(dep3) ... P(depn)
P(dep) = total_number_of_dependencies_for_all_modules
------------------------------------------------
total_number_of_times_module_dep_is_a_dependency
You can pre-compute P(fail), P(dep) and P(dep|fail) from your a-priori corpus
(i.e. CPAN as it stands now). Then to compute the probability of failure for
any set of dependencies, you find the product of P(dep) and P(dep|fail) forall
deps from your pre-computed values and glue the totals into the main equation.
[ NOTE: there's more you can do to optimise classification performance,
particularly when dealing with large documents. e.g. storing the log
probability of P(dep|fail)/P(dep) so you can run a query like:
SELECT EXP(SUM(log_prog)) WHERE dep IN (@deps)
But that's probably outside the scope of this simple case where the number
of dependencies is small (say, a few dozen at most, compared to a document
which might have many thousands of distinct words in it) ]
I still find it all rather mind-boggling clever. But like I say, I'm no math
wiz so take everything I say with a pinch of salt.
A
[1] http://en.wikipedia.org/wiki/Bayes'_theorem
[2] http://www.paulgraham.com/spam.html
More information about the london.pm
mailing list