[OT] benchmarking "typical" programs

Thu Sep 20 00:28:20 BST 2012

On 19 Sep 2012, at 12:09, Nicholas Clark wrote:

> Needs to do realistic things on a big enough scale to stress a typical system.
> Needs to avoid external library dependencies, or particular system specifics.
> Preferably needs to avoid being too Perl version specific.
> Preferably needs to avoid being a maintenance headache itself.

On a previous web-servicey $work project, I had some positive experiences in using Splunk (spunk.com) to extract real malperformant/norma sample data to throw at benchmarking, profiling and load-testing code. :-) (spunk.com)  Splunk also let me use real live behaviour as a gauge in itself and deeply analyse the performance of classes of production request over time.  It also has a query api on cpan.

With regard to the 'controlled tests,' benchmarking used the benchmark module (and jMeter), the profiling was NYTProf and the load testing was jMeter (+ab,whatever).  The tools did not matter so much as the sample data and the fact that I was able to compare runs against a fairly consistent architecture, datasets and execution paths (a function of the test data).  

Splunk is a really great tool for mining, joining and performing statistical time series analysis on sloppy structured data (different daemon logs/network activity/custom instrumented output)- it's not free, however the free license gives you a reasonable daily volume of analytics to play with.  You can probably knock up your own scripts to extract sample input data, but I found it quite painless and powerful (also great visualisation tools).  With respect to the data, my criteria was that I picked the worst outliers (in terms of response time/occurrence) and increased their incidence, mixed in with sample requests spread around the mean response time and various other 'known requests.'   

This would let me hammer the application with likely runtime scenarios taken from real user-behaviour (rather than totally fabricated and contrived imaginings used to just exercise the code you're thinking of).  "Real" runtime scenarios would obviously differ due to caching, load balancing and all the usual production gaff, however it did give me metrics to _compare between releases_, such that one could see if certain/all classes of queries had improved or degraded in these control conditions.

As stated, in addition to this I was also able to use Splunk to watch trends in different classes of request over time.  This allowed one to proactively home in and investigate areas which had degraded in performance or always been generally rubbish.  One could even fire alerts based on these analytics and warn you during exceptionally ill-performing periods.

This is just my experience and "not a hammer to death and load test all all potential execution paths," but I think that is not as important as insuring you haven't regressed in your ability to cover likely runtime scenarios.

Not sure if that helps.

Splunk Fanboy