XML::LibXML and HTML (in >=v1.67)

Wed Apr 1 14:11:07 BST 2009

Tatsuhiko Miyagawa wrote:
> On Tue, Mar 31, 2009 at 10:45 PM, Toby Wintermute <tjc at wintrmute.net> wrote:
>> The problem occurs when the html contains (the commonly used) & symbol
>> within attributes, such as:
>> <a href="/foo?a=b&c=d">
>>
>> I know that really one should escape the ampersand in those
>> circumstances, however real-world web-pages rarely do this.. And this
>> behaviour was tolerated in XML::LibXML 1.66, just not subsequent
>> versions.. but eh, maybe it's just the way I'm calling the parser?
> 
> XML::Liberal [1] exactly addresses issues like this, and it also got
> broken with XML::LibXML 1.67 with its error format change but works
> with 1.69_2 on CPAN.
> 
>> Alternatively.. what do YOU use to parse real-world websites that are
>> often not totally valid?
> 
> I use my own Web::Scraper [2,3] to scrape stuff and it uses
> HTML::TreeBuilder (and ::XPath) to build a DOM tree and runs XPath or
> CSS selector against it. It's definitely slower than LibXML but can
> deal with such broken HTML documents very well. If you really care
> about performance there's also HTML::TreeBuilder::LibXML on github [4]
> that is a drop-in replacement for H::TB::XPath but uses LibXML under
> the hood.

Indeed when I tested the various ways to get XML from HTML, a couple of years 
ago, I found that the best way was to go through HTML::TreeBuilder. It managed 
to make sense, without choking, of more random web pages than both tidy and 
XML::LibXML.

The only problem I found was with tags like '<table 1>' which gets output by the 
as_XML method as '<table 1="1">', which is not quite well-formed XML. This 
doesn't prevent you from using XPath on it with HTML::TreeBuilder::XPath though.

So HTML::TreeBuilder::XPath, beyond being a shameless plug, is my preferred way 
to process HTML while still being able to use XPath.

-- 
mirod