consolidate regexes

Aaron Crane perl at aaroncrane.co.uk
Tue Feb 25 14:59:41 GMT 2014


Alex Balhatchet <kaoru at slackwise.net> wrote:
> The POD of Dan Kogai's Regexp::Trie lists some alternatives and the
> description compares them a little.

The regexes generated by Regexp::Trie will almost certainly be slower
to execute than the regex you'd get with the naive approach:

my $pattern = join '|', map quotemeta, @fixed_strings;
my $rx = qr/$pattern/;

That's because, thanks to demerphq++, Perl 5.10 and above have a
built-in trie optimisation which is defeated by the cleverness of the
Regexp::Trie regexes. To take the example from the Regexp::Trie
documentation:

$ perl -Mre=debug -wE 'qr/foo(?:bar|xar|zap?)/; qr/foobar|fooxar|foozap|fooza/'
Compiling REx "foo(?:bar|xar|zap?)"
Final program:
   1: EXACT <foo> (3)
   3: TRIE-EXACT[bxz] (17)
      <bar> (17)
      <xar> (17)
      <za> (12)
  12:   CURLY {0,1} (17)
  14:     EXACT <p> (0)
  17: END (0)
anchored "foo" at 0 (checking anchored) minlen 5
Compiling REx "foobar|fooxar|foozap|fooza"
Final program:
   1: EXACT <foo> (3)
   3: TRIEC-EXACT[bxz] (17)
      <bar>
      <xar>
      <zap>
      <za>
  17: END (0)
anchored "foo" at 0 (checking anchored) minlen 5
Freeing REx: "foo(?:bar|xar|zap?)"
Freeing REx: "foobar|fooxar|foozap|fooza"

Note that the second regex puts the tail of all four subpatterns into
a single TRIEC-EXACT node.

AFAICT, this also applies to Regexp::Assemble, Regexp::Optimizer, and
Regexp::PreSuf; they all apply a trie optimisation manually, so they
all defeat the built-in trie optimisation. (The documentation for
Regexp::Assemble says that it relies on the built-in optimisation
where available, but I don't observe that behaviour when I try it.)

Smylers++ mentioned my Text::Match::FastAlternatives module upthread.
My measurements suggest it can be quite a lot faster than the trie
support in Perl's regexes, but the difference is small until the
number of strings gets large. If you care deeply about speed for this
sort of thing, it's probably worth running your own benchmarks on a
representative workload.

-- 
Aaron Crane ** http://aaroncrane.co.uk/


More information about the london.pm mailing list