[OT] benchmarking "typical" programs

Fri Sep 21 10:52:09 BST 2012

On 21/09/2012, at 19:22, Nicholas Clark <nick at ccl4.org> wrote:

> On Fri, Sep 21, 2012 at 08:56:34AM +0100, Simon Wistow wrote:
>> On Thu, Sep 20, 2012 at 12:35:18PM +0100, Nicholas Clark said:
>>> Lots of "one trick pony" type benchmarks exist, but very few that actually
>>> try to look like they are doing typical things typical programs do, at the
>>> typical scales real programs work out, so
>> 
>> As a search engineer (recovering) I'm inclined to say - get a corpus of 
>> docs, build an inverted index out of it and then do some searches. This 
>> will test
>> 
>> 
>> 1) File/IO Performance (Reading in the corpus)
>> 2) Text manipulation (Tokenizing, Stop word removal, Stemming)
>> 3) Data structure performance (Building the index)
>> 4) Maths Calculation (performing TF/IDF searches)
>> 
>> All in pretty good, discrete steps. Plus by tweaking the size of the 
>> corpus you can stress memory as well.
> 
> Thanks, this is a useful suggestion, but...
> 
> I'm not a search engineer (recovering or otherwise), so this represents
> rather more work that I wanted to do. In that I first have to learn enough
> of how to *be* a search engineer to figure out how to write the above code
> to do something useful, and *then* how to write such code to a reasonably
> performant production versions, and then to turn working code into something
> sufficiently stand alone to be a benchmark.
> 
> I don't want to be spending my time figuring out the right way to do all the
> above algorithms in Perl. I want to get as fast as possible to the point of
> figuring out how the perl interpreter (mis)behaves when presented with
> extant decent code to do the above.
> 
> Unless there's a CPAN-in-a-box for doing most of the four steps.
> (which doesn't depend on external C libraries. That was one of my
> "preferably" criteria)
> 
> So, next question - if I wanted to be as lazy as possible and write a search
> engine (as described above) using as much of CPAN as possible, which modules
> are recommended? :-)
> 

I think you want Plucene. But please let someone else correct me if I'm wrong. 

> Nicholas Clark