Parse-text-from-HTML CPAN module ?

Stephen Collyer scollyer at
Fri Dec 9 18:00:45 GMT 2005

Ovid wrote:
> --- Stephen Collyer <scollyer at> wrote:
> <
>>Thanks. Still rather more low level than what I'd like ideally.
>>Maybe I should stop looking and start coding - it may be quicker.
> Agreed that it's lower level than what you want, but it does make
> extracting text pretty quick:
>   my $parser = HTML::TokeParser::Simple->new( file => $file );
>   my $text   = '';
>   while (my $token = $parser->get_token) {
>       $text .= $token->as_is if $token->is_text;
>   }

Right. It doesn't look like a bad place to start; I guess processing
the HTML via a lexer-like interface gives lots of scope for
building up any required data structure on-the-fly.

BTW, I can't figure out from the POD what I get back from as_is.
Is it something a la SAX characters method where the amount of text
returned is not defined, or is it a single w/s separated word, or what ?
I guess this is covered in the HTML::TokeParser docs ?


Stephen Collyer
Netspinner Ltd

More information about the mailing list