XML::LibXML and HTML (in >=v1.67)

Thu Apr 2 02:13:12 BST 2009

2009/4/1 Tatsuhiko Miyagawa <miyagawa at gmail.com>:
> On Wed, Apr 1, 2009 at 2:53 AM, Dave Cross <dave at dave.org.uk> wrote:
>>> I know that really one should escape the ampersand in those
>>> circumstances, however real-world web-pages rarely do this.. And this
>>> behaviour was tolerated in XML::LibXML 1.66, just not subsequent
>>> versions.. but eh, maybe it's just the way I'm calling the parser?
>>
>> Sounds like XML::LibXML has fixed a bug. XML parsers are supposed to throw
>> an exception when they encounter invalid XML.
>
> The method we're talking about here is parse_*html*, and libxml2
> continues parsing HTML with errors like this, and XML::LibXML has an
> option (recover=>1) not to choke on that:
>
> perldoc XML::LibXML::Parser
>
>       Parsing HTML may cause problems, especially if the ampersand ('&') is
>       used. .... Such links cause the parser to throw errors. In
>       such cases libxml2 still parses the entire document as there was no
>       error ...  Such HTML documents should be parsed using the recover flag.

That is indeed what the POD docs say, and in version 1.66, the
behaviour matched the documentation.

In 1.67 to 1.69_2, the behaviour appears to differ -- ie. Such errors
are fatal, despite using the recover flag.