XML::LibXML and HTML (in >=v1.67)

Thu Apr 2 02:29:11 BST 2009

2009/4/2 mirod <mirod at xmltwig.com>:
> Tatsuhiko Miyagawa wrote:
>>
>> On Tue, Mar 31, 2009 at 10:45 PM, Toby Wintermute <tjc at wintrmute.net>
>> wrote:
>>>
>>> The problem occurs when the html contains (the commonly used) & symbol
>>> within attributes, such as:
>>> <a href="/foo?a=b&c=d">
[snip]
>
> Indeed when I tested the various ways to get XML from HTML, a couple of
> years ago, I found that the best way was to go through HTML::TreeBuilder. It
> managed to make sense, without choking, of more random web pages than both
> tidy and XML::LibXML.
>
> The only problem I found was with tags like '<table 1>' which gets output by
> the as_XML method as '<table 1="1">', which is not quite well-formed XML.
> This doesn't prevent you from using XPath on it with
> HTML::TreeBuilder::XPath though.
>
> So HTML::TreeBuilder::XPath, beyond being a shameless plug, is my preferred
> way to process HTML while still being able to use XPath.

Ah, I was hoping I wouldn't have to go and re-write the xpath queries
as HTML::Tree look_down() queries.. So H-TB-XPath looks great :)

Thanks!
Toby