XML::LibXML and HTML (in >=v1.67)
dave at dave.org.uk
Wed Apr 1 10:53:28 BST 2009
Toby Wintermute wrote:
> I know that really one should escape the ampersand in those
> circumstances, however real-world web-pages rarely do this.. And this
> behaviour was tolerated in XML::LibXML 1.66, just not subsequent
> versions.. but eh, maybe it's just the way I'm calling the parser?
Sounds like XML::LibXML has fixed a bug. XML parsers are supposed to
throw an exception when they encounter invalid XML.
What you're trying to parse isn't XML. Therefore you shouldn't expect to
be able to parse it with an XML parser.
> Alternatively.. what do YOU use to parse real-world websites that are
> often not totally valid?
Two options spring to mind.
Firstly, you could use something like HTML Tidy to clean up the input
before you try to parse it. I assume that will fix problems like the one
Or, alternatively, you could try the (badly named) XML::Liberal which
parses stuff that isn't really XML.
More information about the london.pm