XML::LibXML and HTML (in >=v1.67)
abuse at cabal.org.uk
Wed Apr 1 10:37:28 BST 2009
On Wed, Apr 01, 2009 at 04:45:28PM +1100, Toby Wintermute wrote:
> I know that really one should escape the ampersand in those circumstances,
> however real-world web-pages rarely do this.. And this behaviour was
> tolerated in XML::LibXML 1.66, just not subsequent versions.. but eh,
> maybe it's just the way I'm calling the parser?
Possibly, but I have the same problem and never figured out how to get
XML::LibXML to directly parse such documents.
> Alternatively.. what do YOU use to parse real-world websites that are
> often not totally valid?
I clean up the source document like so:
$string =~ s/&(?!(?:\w+|#\d+);)/&/g;
More information about the london.pm