Parse-text-from-HTML CPAN module ?

Fri Dec 9 18:45:41 GMT 2005

--- Stephen Collyer <scollyer at netspinner.co.uk> wrote:

> >   my $parser = HTML::TokeParser::Simple->new( file => $file );
> >   my $text   = '';
> >   while (my $token = $parser->get_token) {
> >       $text .= $token->as_is if $token->is_text;
> >   }
> 
> BTW, I can't figure out from the POD what I get back from as_is.
> Is it something a la SAX characters method where the amount of
> text returned is not defined, or is it a single w/s separated
> word, or what ?
> I guess this is covered in the HTML::TokeParser docs ?

"as_is" is my phenomenally misnamed "as_string" method.

Basically, whenever you have a token, you can print $token->as_is to
get the text of the token:

   while (my $token = $parser->get_token) {
       print $token->as_is;
   }

That should print the entire HTML document, unchanged, right down to
the newlines.  You will want to read the "CAVEATS" section of the docs
to understand a rare edge case: 
http://search.cpan.org/dist/HTML-TokeParser-Simple/lib/HTML/TokeParser/Simple.pm#CAVEATS

As a side note:  for a quick-n-dirty cleanup of HTML:

   while (my $token = $parser->get_token) {
       $token->rewrite_tag;
       print $token->as_is;
   }

That will lower-case tags and attributes and properly quote attribute
values.

Cheers,
Ovid

-- 
If this message is a response to a question on a mailing list, please send
follow up questions to the list.

Web Programming with Perl -- http://users.easystreet.com/ovid/cgi_course/