Parse-text-from-HTML CPAN module ?

Ovid publiustemp-londonpm at
Fri Dec 9 18:45:41 GMT 2005

--- Stephen Collyer <scollyer at> wrote:

> >   my $parser = HTML::TokeParser::Simple->new( file => $file );
> >   my $text   = '';
> >   while (my $token = $parser->get_token) {
> >       $text .= $token->as_is if $token->is_text;
> >   }
> BTW, I can't figure out from the POD what I get back from as_is.
> Is it something a la SAX characters method where the amount of
> text returned is not defined, or is it a single w/s separated
> word, or what ?
> I guess this is covered in the HTML::TokeParser docs ?

"as_is" is my phenomenally misnamed "as_string" method.

Basically, whenever you have a token, you can print $token->as_is to
get the text of the token:

   while (my $token = $parser->get_token) {
       print $token->as_is;

That should print the entire HTML document, unchanged, right down to
the newlines.  You will want to read the "CAVEATS" section of the docs
to understand a rare edge case:

As a side note:  for a quick-n-dirty cleanup of HTML:

   while (my $token = $parser->get_token) {
       print $token->as_is;

That will lower-case tags and attributes and properly quote attribute


