# Bayesian Classification of CPAN Module Failures (Re: Module dependencies and test results)

Andy Wardley abw at wardley.org
Mon Aug 6 09:25:44 BST 2007

```David Cantrell wrote:
> The data I'm spitting out are insufficient for calculating this.  If
> module A depends on B and C, both of which depend on D, then D appears
> as a dependency twice, but I only list it once.

Hmm... I'm not sure that it matters.  If we were classifying documents then we
would need to be counting word frequency to determine how often "D appears in
A".  By analogy, the more times "Viagra" appears in a message, the more likely
it is to be spam.

But in this case, I think the number of different ways in which D is a
dependency of A is immaterial.  D only needs to be a dependency once for it to
cause a failure.  Adding more or less dependencies from A to D won't make A
any more or less broken (where n > 1).  One is enough.

> Also I'm not convinced
> that the rest of the sums are right, given that if D fails when you try
> to install it as a dependency of B, then the probability of it failing
> when you try to install it as a dependency of C is 1 - these are not
> independent failures.

Remember that we're dealing with probabilities here, not inductive logic.

So rather than looking at what happens when *you* install modules A, B, C and
D, we're looking at what happens when *lots* of different people install these
modules on different systems.

Your failure with module D gives *you* a probability for failure of 1 in that
very small sample set (1 failure from 1 test).  But when you consider the
other 99 people who installed module D without a hitch it becomes clear that
the *overall* probability for failure is 1/100.

A

```