Parse-text-from-HTML CPAN module ?
Ovid
publiustemp-londonpm at yahoo.com
Fri Dec 9 18:45:41 GMT 2005
--- Stephen Collyer <scollyer at netspinner.co.uk> wrote:
> > my $parser = HTML::TokeParser::Simple->new( file => $file );
> > my $text = '';
> > while (my $token = $parser->get_token) {
> > $text .= $token->as_is if $token->is_text;
> > }
>
> BTW, I can't figure out from the POD what I get back from as_is.
> Is it something a la SAX characters method where the amount of
> text returned is not defined, or is it a single w/s separated
> word, or what ?
> I guess this is covered in the HTML::TokeParser docs ?
"as_is" is my phenomenally misnamed "as_string" method.
Basically, whenever you have a token, you can print $token->as_is to
get the text of the token:
while (my $token = $parser->get_token) {
print $token->as_is;
}
That should print the entire HTML document, unchanged, right down to
the newlines. You will want to read the "CAVEATS" section of the docs
to understand a rare edge case:
http://search.cpan.org/dist/HTML-TokeParser-Simple/lib/HTML/TokeParser/Simple.pm#CAVEATS
As a side note: for a quick-n-dirty cleanup of HTML:
while (my $token = $parser->get_token) {
$token->rewrite_tag;
print $token->as_is;
}
That will lower-case tags and attributes and properly quote attribute
values.
Cheers,
Ovid
--
If this message is a response to a question on a mailing list, please send
follow up questions to the list.
Web Programming with Perl -- http://users.easystreet.com/ovid/cgi_course/
More information about the london.pm
mailing list