XML::LibXML and HTML (in >=v1.67)

Pedro Figueiredo me at pedrofigueiredo.org
Wed Apr 1 13:18:39 BST 2009


On 1 Apr 2009, at 06:45, Toby Wintermute wrote:

>
> Alternatively.. what do YOU use to parse real-world websites that are
> often not totally valid?

If it's a quick hack I'll use HTML::Tidy like so:

my $tidy = HTML::Tidy->new({
     output_xhtml => 1,
     numeric_entities => 1,
});
$tidy->ignore( type => TIDY_WARNING );
$html = $tidy->clean( $html );

which I then feed to XML::XPath.

If it's something long-term-ish, I use Web::Scraper.

Cheers,

Pedro


More information about the london.pm mailing list