Parse-text-from-HTML CPAN module ?
Andy Wardley
abw at wardley.org
Sat Dec 10 10:03:55 GMT 2005
Stephen Collyer wrote:
> I have a search-related requirement to take some arbitrary HTML,
> parse out the text and stem it/apply stop words and so on. Now,
> I can cook something up myself with the usual set of modules, but
> this sounds like such a common requirement that someone will
> already have done it and packaged it up, in a nice reusable form.
Not in a nice reusable form, but I have code you can cut-n-paste.
http://wardley.org/perl/Search.pm
It's a hacked-up module I wrote as part of a project for a customer.
It's based on code I gleaned from Advanced Perl Programming.
You can see it working here:
http://wardray-premise.com/
You'll need to tweak it a bit to get it working. Change 'WP::Base' to
'Class::Base', provide your own config values instead of 'WP::Config',
and remove any user-specific search tweaks I may have added (unless
you happen to be indexing many documents that contain the word "x-ray").
Usage is something like this:
my $search = WP::Search->new();
$search->index_file($path, { title => "The Badger's Bell End",
keywords => "Badger, bell, machine gun" });
my $results = $search->search("badger rabbit bell ringing");
# $results->{ query } # original query
# $results->{ words } # words in query
# $results->{ stems } # stems of words in query
# $results->{ results } # list of results, each is hash containing
# document, relevance and percent items.
HTH
A
More information about the london.pm
mailing list