Schwartzian transform

Wed Aug 13 14:43:11 BST 2014

On Wed, Aug 13, 2014 at 01:45:24PM +0100, Dermot wrote:
> On 13 August 2014 13:24, Abigail <abigail at abigail.be> wrote:
> 
> > On Wed, Aug 13, 2014 at 09:45:59PM +1000, Damian Conway wrote:
> > > Mark and Aaron are ntirely correct that the ST is not required for your
> > example,
> > > but if your actual application uses a significantly larger dataset than
> > just
> > > four hash entries, then the ST may still be preferable...as array lookups
> > > are around twice as fast as hash lookups. That might be a significant
> > > performance enhancement if you have to do O(NlogN) key look-ups for
> > > a sufficiently large value of N.
> > >
> > > As always, only benchmarking on real(istic) data can determine whether
> > > you will actually benefit from using the ST or not.
> >
> >
> > But if your dataset is that large that you will benefit from halving
> > the lookup time, you can save twice the amount of time by not having
> > a lookup time at all -- use the Guttman-Rossler Transform.
> >
> >
> Does anyone want to venture a guess - highly un-scientific I know but I am,
> after all, an instinctive guy - at what the sweet spot for N would be? The
> entire data set is about 500,000 records but I doubt that my data structure
> would stretch to more than 90 records as these are user created selection.
> If N is < 90, I'll opt for ''for'.

90 is a very low number. If your dataset is 90 or less, it's unlikely 
to matter whether your wrap your sort in a transform or not -- unless
you're running on some hardware from the early 70s.

But where the flip over point will be depends on many, many factors, 
and even a ball park figure can only be done after testing with
real data, on the *same environment* as where your production code
will run. Just a handful of factors that matter:

  - Perl version
  - Hash function used
  - Compiler/compiler options
  - CPU Architecture
  - Memory (amount of free memory, speed of memory/caches)
  - Total size of program
  - Sort function used by Perl (mergesort, quicksort)
  - How sorted the data is to begin with (matters a lot for mergesort,
    less for quicksort -- although if it comes from a hash, it's
    pretty random, and, in the most recent versions of Perl, will be
    different from run to run).

If you want to know, *measure*. And all that measurement gives you, at
best, is a ball park figure. Do remember that performing proper measurements
is a third art, a third science, a third engineering, and a third magic.

And that's not even factoring in that the time spend sorting (regardless 
of the sort) may be dwarved by whatever else the program is doing.

Abigail