[REVIEW] Programming Collective Intelligence

Simon Wistow simon at thegestalt.org
Thu Dec 6 08:29:39 GMT 2007

Author: Toby Segaran
ISBN: 0-596-52932-5
Publisher: O'Reilly Associates

The field of data mining is a tricky one to write about. For a start 
what you're mining depends on the nature of your business and the shape 
of the data - there is no one-size-fits-all technique, no off the shelf, 
drag and drop solution.

Secondly some of the techniques require some pretty tricksy maths and 
even if you do understand them then once they're applied you still have 
to interpret the results and tweak the multitude of input variables. 
Building a data mining tool - from a search engine to a collaborative 
filter to a genetic algorithm - is an art as much as a science or 
engineering problem.

So all that said, you should buy this book.

Reading it will help you understand why I just said all that. But it 
will also give you a bunch more techniques in your mental toolbox so 
that when you're looking at a problem you can think "Ooooh! I 
remembering reading about some problem like that" and then you can go 
pick up the book again and use it as a reference manual rather than 
reading it from cover to cover.

And there's a goodly number of techniques to pick up and there's a lot 
to cover - there are chapters on collaborative filtering and 
recommendation systems, clustering and group discovery, search and 
ranking techniques, document filtering, Bayesian classification, kernel 
methods and support-vector machines, and genetic algorithms, amongst 

Each chapter gives an overview of the problem domain, gives an example 
problem and then walks the reader through a simple solution. The 
problems with the solution are then highlighted and various enhancements 
are shown.

The techniques are demonstrated in Python - although they are all clear, 
understandable and perfectly legible to any competent programmer, 
especially a scripting language programmer. Just enough detail is 
covered to give you a solid grounding without getting you bogged down.

In summary - this is well worth your 20 quid, even more so if you can 
get your company to pay for it. If you're working with existing data 
this may spark off an inspiration that will let you add some new 
features or up your accuracy. Or if you're presented with a problem this 
book may give you techniques that will help you solve it without having 
to work everything out from first principles. It's well written manual 
that'll handily expand your repetoire.

More information about the london.pm mailing list