XML::LibXML and HTML (in >=v1.67)
tjc at wintrmute.net
Thu Apr 2 02:21:46 BST 2009
2009/4/1 Tatsuhiko Miyagawa <miyagawa at gmail.com>:
> On Tue, Mar 31, 2009 at 10:45 PM, Toby Wintermute <tjc at wintrmute.net> wrote:
>> The problem occurs when the html contains (the commonly used) & symbol
>> within attributes, such as:
>> <a href="/foo?a=b&c=d">
>> I know that really one should escape the ampersand in those
>> circumstances, however real-world web-pages rarely do this.. And this
>> behaviour was tolerated in XML::LibXML 1.66, just not subsequent
>> versions.. but eh, maybe it's just the way I'm calling the parser?
> XML::Liberal  exactly addresses issues like this, and it also got
> broken with XML::LibXML 1.67 with its error format change but works
> with 1.69_2 on CPAN.
>> Alternatively.. what do YOU use to parse real-world websites that are
>> often not totally valid?
> I use my own Web::Scraper [2,3] to scrape stuff and it uses
> HTML::TreeBuilder (and ::XPath) to build a DOM tree and runs XPath or
> CSS selector against it. It's definitely slower than LibXML but can
> deal with such broken HTML documents very well. If you really care
> about performance there's also HTML::TreeBuilder::LibXML on github 
> that is a drop-in replacement for H::TB::XPath but uses LibXML under
> the hood.
Thanks, Web::Scraper looks quite neat.
However I want to avoid applications breaking on random CPAN module
upgrades (as just happened with the XML::LibXML upgrade yesterday), so
I might steer clear of it until it loses the big, bold warning about
the API still being unstable.
I'm sure you understand :)
> Another option would be to filter out such XHTML errors with
> HTML::Tidy before passing it to LibXML. It would be neat if you do
> that cleanup only if libxml parsing fails even with recover_errors
> etc. set.
Hmm, that is an interesting idea. This particular company's website
also tends to get other aspects of their HTML broken too, such as
repeated id values, so maybe HTML::Tidy will help with that too.
More information about the london.pm