XML::LibXML and HTML (in >=v1.67)

Wed Apr 1 11:11:26 BST 2009

On Wed, Apr 1, 2009 at 2:53 AM, Dave Cross <dave at dave.org.uk> wrote:
>> I know that really one should escape the ampersand in those
>> circumstances, however real-world web-pages rarely do this.. And this
>> behaviour was tolerated in XML::LibXML 1.66, just not subsequent
>> versions.. but eh, maybe it's just the way I'm calling the parser?
>
> Sounds like XML::LibXML has fixed a bug. XML parsers are supposed to throw
> an exception when they encounter invalid XML.

The method we're talking about here is parse_*html*, and libxml2
continues parsing HTML with errors like this, and XML::LibXML has an
option (recover=>1) not to choke on that:

perldoc XML::LibXML::Parser

       Parsing HTML may cause problems, especially if the ampersand ('&') is
       used. .... Such links cause the parser to throw errors. In
       such cases libxml2 still parses the entire document as there was no
       error ...  Such HTML documents should be parsed using the recover flag.


-- 
Tatsuhiko Miyagawa