Parse-text-from-HTML CPAN module ?

Jonathan Stowe jns at
Fri Dec 9 11:52:57 GMT 2005

On Fri, 2005-12-09 at 11:10, Stephen Collyer wrote:
> I have a search-related requirement to take some arbitrary HTML,
> parse out the text and stem it/apply stop words and so on. Now,
> I can cook something up myself with the usual set of modules, but
> this sounds like such a common requirement that someone will
> already have done it and packaged it up, in a nice reusable form.
> Does anyone know if there's a nice, Pure Perl implementation of
> this that I can pick up and use with no further brain-power required ?
> (I'm wondering if there's something in the WWW::Mechanize area that
> is suitable, as that seems to have grown a lot since I last looked).

Getting just the text is a piece of piss with HTML::Parser:

use strict;
use warnings;
my $the_file =<<EOH;
<h1>Test Title</h1>
<p>This is a test</p></body></html>
use HTML::Parser;
my $parser = HTML::Parser->new( text_h => [ \&text_handler,"self,dtext"
                                start_document_h => [\&init, "self"] );
print @{$parser->{_private}->{text}};
sub init
   my ( $self ) = @_;
   $self->{_private}->{text} = [];
sub text_handler
    my ( $self, $text) = @_;
    push @{$self->{_private}->{text}}, $text;


This e-mail is sponsored by

More information about the mailing list