XML::LibXML and HTML (in >=v1.67)

Dave Cross dave at dave.org.uk
Wed Apr 1 10:53:28 BST 2009


Toby Wintermute wrote:

> I know that really one should escape the ampersand in those
> circumstances, however real-world web-pages rarely do this.. And this
> behaviour was tolerated in XML::LibXML 1.66, just not subsequent
> versions.. but eh, maybe it's just the way I'm calling the parser?

Sounds like XML::LibXML has fixed a bug. XML parsers are supposed to 
throw an exception when they encounter invalid XML.

What you're trying to parse isn't XML. Therefore you shouldn't expect to 
be able to parse it with an XML parser.

> Alternatively.. what do YOU use to parse real-world websites that are
> often not totally valid?

Two options spring to mind.

Firstly, you could use something like HTML Tidy to clean up the input 
before you try to parse it. I assume that will fix problems like the one 
you're seeing.

Or, alternatively, you could try the (badly named) XML::Liberal which 
parses stuff that isn't really XML.

Cheers,

Dave...


More information about the london.pm mailing list