Parse-text-from-HTML CPAN module ?
Stephen Collyer
scollyer at netspinner.co.uk
Fri Dec 9 18:00:45 GMT 2005
Ovid wrote:
> --- Stephen Collyer <scollyer at netspinner.co.uk> wrote:
>
> <http://search.cpan.org/~ovid/HTML-TokeParser-Simple-3.15/lib/HTML/
>
>>>TokeParser/Simple/Token/Text.pm>
>>>
>>
>>Thanks. Still rather more low level than what I'd like ideally.
>>Maybe I should stop looking and start coding - it may be quicker.
>
>
> Agreed that it's lower level than what you want, but it does make
> extracting text pretty quick:
>
> my $parser = HTML::TokeParser::Simple->new( file => $file );
> my $text = '';
> while (my $token = $parser->get_token) {
> $text .= $token->as_is if $token->is_text;
> }
Right. It doesn't look like a bad place to start; I guess processing
the HTML via a lexer-like interface gives lots of scope for
building up any required data structure on-the-fly.
BTW, I can't figure out from the POD what I get back from as_is.
Is it something a la SAX characters method where the amount of text
returned is not defined, or is it a single w/s separated word, or what ?
I guess this is covered in the HTML::TokeParser docs ?
--
Regards
Stephen Collyer
Netspinner Ltd
More information about the london.pm
mailing list