XML::LibXML and HTML (in >=v1.67)
mirod
mirod at xmltwig.com
Wed Apr 1 14:11:07 BST 2009
Tatsuhiko Miyagawa wrote:
> On Tue, Mar 31, 2009 at 10:45 PM, Toby Wintermute <tjc at wintrmute.net> wrote:
>> The problem occurs when the html contains (the commonly used) & symbol
>> within attributes, such as:
>> <a href="/foo?a=b&c=d">
>>
>> I know that really one should escape the ampersand in those
>> circumstances, however real-world web-pages rarely do this.. And this
>> behaviour was tolerated in XML::LibXML 1.66, just not subsequent
>> versions.. but eh, maybe it's just the way I'm calling the parser?
>
> XML::Liberal [1] exactly addresses issues like this, and it also got
> broken with XML::LibXML 1.67 with its error format change but works
> with 1.69_2 on CPAN.
>
>> Alternatively.. what do YOU use to parse real-world websites that are
>> often not totally valid?
>
> I use my own Web::Scraper [2,3] to scrape stuff and it uses
> HTML::TreeBuilder (and ::XPath) to build a DOM tree and runs XPath or
> CSS selector against it. It's definitely slower than LibXML but can
> deal with such broken HTML documents very well. If you really care
> about performance there's also HTML::TreeBuilder::LibXML on github [4]
> that is a drop-in replacement for H::TB::XPath but uses LibXML under
> the hood.
Indeed when I tested the various ways to get XML from HTML, a couple of years
ago, I found that the best way was to go through HTML::TreeBuilder. It managed
to make sense, without choking, of more random web pages than both tidy and
XML::LibXML.
The only problem I found was with tags like '<table 1>' which gets output by the
as_XML method as '<table 1="1">', which is not quite well-formed XML. This
doesn't prevent you from using XPath on it with HTML::TreeBuilder::XPath though.
So HTML::TreeBuilder::XPath, beyond being a shameless plug, is my preferred way
to process HTML while still being able to use XPath.
--
mirod
More information about the london.pm
mailing list