Schwartzian transform

Abigail abigail at abigail.be
Wed Aug 13 14:43:11 BST 2014


On Wed, Aug 13, 2014 at 01:45:24PM +0100, Dermot wrote:
> On 13 August 2014 13:24, Abigail <abigail at abigail.be> wrote:
> 
> > On Wed, Aug 13, 2014 at 09:45:59PM +1000, Damian Conway wrote:
> > > Mark and Aaron are ntirely correct that the ST is not required for your
> > example,
> > > but if your actual application uses a significantly larger dataset than
> > just
> > > four hash entries, then the ST may still be preferable...as array lookups
> > > are around twice as fast as hash lookups. That might be a significant
> > > performance enhancement if you have to do O(NlogN) key look-ups for
> > > a sufficiently large value of N.
> > >
> > > As always, only benchmarking on real(istic) data can determine whether
> > > you will actually benefit from using the ST or not.
> >
> >
> > But if your dataset is that large that you will benefit from halving
> > the lookup time, you can save twice the amount of time by not having
> > a lookup time at all -- use the Guttman-Rossler Transform.
> >
> >
> Does anyone want to venture a guess - highly un-scientific I know but I am,
> after all, an instinctive guy - at what the sweet spot for N would be? The
> entire data set is about 500,000 records but I doubt that my data structure
> would stretch to more than 90 records as these are user created selection.
> If N is < 90, I'll opt for ''for'.


90 is a very low number. If your dataset is 90 or less, it's unlikely 
to matter whether your wrap your sort in a transform or not -- unless
you're running on some hardware from the early 70s.


But where the flip over point will be depends on many, many factors, 
and even a ball park figure can only be done after testing with
real data, on the *same environment* as where your production code
will run. Just a handful of factors that matter:

  - Perl version
  - Hash function used
  - Compiler/compiler options
  - CPU Architecture
  - Memory (amount of free memory, speed of memory/caches)
  - Total size of program
  - Sort function used by Perl (mergesort, quicksort)
  - How sorted the data is to begin with (matters a lot for mergesort,
    less for quicksort -- although if it comes from a hash, it's
    pretty random, and, in the most recent versions of Perl, will be
    different from run to run).

If you want to know, *measure*. And all that measurement gives you, at
best, is a ball park figure. Do remember that performing proper measurements
is a third art, a third science, a third engineering, and a third magic.


And that's not even factoring in that the time spend sorting (regardless 
of the sort) may be dwarved by whatever else the program is doing.


Abigail


More information about the london.pm mailing list