XML::LibXML and HTML (in >=v1.67)

Wed Apr 1 10:37:28 BST 2009

On Wed, Apr 01, 2009 at 04:45:28PM +1100, Toby Wintermute wrote:
[...]
> I know that really one should escape the ampersand in those circumstances,
> however real-world web-pages rarely do this.. And this behaviour was
> tolerated in XML::LibXML 1.66, just not subsequent versions.. but eh,
> maybe it's just the way I'm calling the parser?

Possibly, but I have the same problem and never figured out how to get
XML::LibXML to directly parse such documents.

> Alternatively.. what do YOU use to parse real-world websites that are
> often not totally valid?

I clean up the source document like so:

  $string =~ s/&(?!(?:\w+|#\d+);)/&amp;/g;