Why Perl needs a VM

Wed Sep 5 12:21:25 BST 2007

For large feeds, depending on the structure and semantics of your
documents, one approach you may want to consider is a combination of
SAX/DOM parsing, where at runtime you DOM parse a 'reconstructed
subtree' (at sax time) of your main document.  This would contstrain the
depth of your xpaths and the overall size of the DOM tree.  For large
trees, which require DOM parsing, it can be quite performant.  You could
also potentially parallelise your processing of a single document, via
this approach..

Depending on your requirements, you could also hold some state between
subtree parses and inject nodes into the reconstructed tree - where
there is a dependency on some previously parsed artifact..  

The likes of XML::Twig might also be worth looking at, but I don't know
much about the underlying implementation/performance.

Just a thought.

R.

> -----Original Message-----
> From: london.pm-bounces at london.pm.org 
> [mailto:london.pm-bounces at london.pm.org] On Behalf Of ben at bpfh.net
> Sent: 04 September 2007 23:00
> To: London.pm Perl M[ou]ngers
> Subject: Re: Why Perl needs a VM
> 
> On Tue, Sep 04, 2007 at 03:34:00PM -0400, Matt Sergeant wrote:
> >On 4-Sep-07, at 2:20 PM, ben at bpfh.net wrote:
> >
> >>I've had a poke at the code and sure enough, we're using 
> XPathContext, 
> >>which I'd thought was a pure perl piece on top of XML::LibXML. It 
> >>isn't - it's got a C implementation at its heart.
> >>
> >>The Java implementation is still substantially quicker.
> >
> >Then you're doing something wrong. Or it's not the XPath part that's 
> >slow.
> >
> >XML::LibXML is significantly faster than any Java implementation.
> >
> >http://www.xml.com/pub/a/2007/05/16/xml-parser-benchmarks-part-2.html
> 
> Matt, these benchmarks are very interesting - thanks for posting them.
> 
> Our typical use case is a document size of 2-10M, so these 
> results go some way to explaining what we're seeing - as 
> that's the range where the results you pointed at show Java 
> 1.5 or JDOM to start being faster than libxml2.
> 
> Of course, we should also remember that these benchmarks are 
> strictly for libxml2, rather than XML::LibXML. I would expect 
> only a trivial additive constant time adjustment from Perl's 
> string handling overhead, which would be lost in the noise of 
> a 4M document, but it's probably worth checking that assumption.
> 
> I'll have a proper look when I get some extra tuits - I'm 
> particularly interested in how sensitive these numbers are to 
> the ratio of number of nodes to size of document, but this is 
> a great signpost.
> 
> Cheers,
> 
> Ben
>
--------------------------------------------------------

NOTICE: If received in error, please destroy and notify sender. Sender does not intend to waive confidentiality or privilege. Use of this email is prohibited when received in error.