XML::LibXML and HTML (in >=v1.67)

Wed Apr 1 06:45:28 BST 2009

Hi,
I've been using XML::LibXML in the back-end of rea-toys[1] to scrape a
certain website for a while now, but noticed it all broke down when I
upgraded XML::LibXML from 1.66 to 1.69, and after some quick testing I
narrowed the change down to being between version 1.66 and 1.67.
My first instinct was to write a test and create a RT ticket [2], but
it occurs to me now that maybe I'm just Doing It Wrong..
If you have a moment.. Does this look right to you?

# Assuming $html contains a fairly typical HTML or XHTML webpage..
my $parser = XML::LibXML->new;
my $doc = $parser->parse_html_string(
    $html => { recover => 1, suppress_errors => 1 }
);

The problem occurs when the html contains (the commonly used) & symbol
within attributes, such as:
<a href="/foo?a=b&c=d">

I know that really one should escape the ampersand in those
circumstances, however real-world web-pages rarely do this.. And this
behaviour was tolerated in XML::LibXML 1.66, just not subsequent
versions.. but eh, maybe it's just the way I'm calling the parser?

Alternatively.. what do YOU use to parse real-world websites that are
often not totally valid?

If you'd like to see the small standalone test, click through to the
RT ticket and it is attached there.
Cheers,
Toby

[1: git://github.com/TJC/rea-toys.git ]
[2: http://rt.cpan.org/Public/Bug/Display.html?id=44715 ]