XML::LibXML and HTML (in >=v1.67)
Pedro Figueiredo
me at pedrofigueiredo.org
Wed Apr 1 13:18:39 BST 2009
On 1 Apr 2009, at 06:45, Toby Wintermute wrote:
>
> Alternatively.. what do YOU use to parse real-world websites that are
> often not totally valid?
If it's a quick hack I'll use HTML::Tidy like so:
my $tidy = HTML::Tidy->new({
output_xhtml => 1,
numeric_entities => 1,
});
$tidy->ignore( type => TIDY_WARNING );
$html = $tidy->clean( $html );
which I then feed to XML::XPath.
If it's something long-term-ish, I use Web::Scraper.
Cheers,
Pedro
More information about the london.pm
mailing list