XML::LibXML and HTML (in >=v1.67)

Pedro Figueiredo me at pedrofigueiredo.org
Wed Apr 1 13:18:39 BST 2009

On 1 Apr 2009, at 06:45, Toby Wintermute wrote:

> Alternatively.. what do YOU use to parse real-world websites that are
> often not totally valid?

If it's a quick hack I'll use HTML::Tidy like so:

my $tidy = HTML::Tidy->new({
     output_xhtml => 1,
     numeric_entities => 1,
$tidy->ignore( type => TIDY_WARNING );
$html = $tidy->clean( $html );

which I then feed to XML::XPath.

If it's something long-term-ish, I use Web::Scraper.



