XML::LibXML and HTML (in >=v1.67)
tjc at wintrmute.net
Thu Apr 2 02:29:11 BST 2009
2009/4/2 mirod <mirod at xmltwig.com>:
> Tatsuhiko Miyagawa wrote:
>> On Tue, Mar 31, 2009 at 10:45 PM, Toby Wintermute <tjc at wintrmute.net>
>>> The problem occurs when the html contains (the commonly used) & symbol
>>> within attributes, such as:
>>> <a href="/foo?a=b&c=d">
> Indeed when I tested the various ways to get XML from HTML, a couple of
> years ago, I found that the best way was to go through HTML::TreeBuilder. It
> managed to make sense, without choking, of more random web pages than both
> tidy and XML::LibXML.
> The only problem I found was with tags like '<table 1>' which gets output by
> the as_XML method as '<table 1="1">', which is not quite well-formed XML.
> This doesn't prevent you from using XPath on it with
> HTML::TreeBuilder::XPath though.
> So HTML::TreeBuilder::XPath, beyond being a shameless plug, is my preferred
> way to process HTML while still being able to use XPath.
Ah, I was hoping I wouldn't have to go and re-write the xpath queries
as HTML::Tree look_down() queries.. So H-TB-XPath looks great :)
More information about the london.pm