Boolean-style text searching
Stephen Collyer
scollyer at netspinner.co.uk
Fri Dec 9 16:05:39 GMT 2005
Paul Makepeace wrote:
> Are there any CPAN goodies that will effect text searching `a la "(escort
> OR ford) AND NOT (estuary OR brook OR erotic services)"? Or at least
> some of the way there?
>
> Paul
>
Do you want to search unindexed text, or are you intending
to index it first ? I have a P::RD grammar that does this
which relies on a binary-coded inverted index that lives
in a MySQL DB, thus:
> my $BooleanExpr = q{
>
> boolean : expr /\Z/
> {
> blob2list($item[1]);
> }
>
> expr : disj
> {
> $item[1];
> }
>
> disj : <leftop: conj /(?:or|\|)/ conj>
> {
> or_indices($item[1]);
> }
>
> conj : <leftop: unary /(?:and|\&)/ unary>
> {
> and_indices($item[1]);
> }
>
> unary : /(?:not|\!)/ unary
> {
> get_not_raw_word_index($item[2]);
> }
> |
> atom
> {
> $item[1];
> }
> |
> '(' expr ')'
> {
> $item[2];
> }
>
> atom : /[\w+#-]+/
> {
> get_raw_word_index($item[1])
> }
>
> };
The various blob2list/and_indices/etc functions perform
various DB related index operations which I can dig out
if you're sufficiently interested, but this may be enough
to get you going. I hacked the binary index logic myself,
but ISTR there's some CPAN modules that will do this for
you now.
You can run the grammar above with something like:
>my $parser = Parse::RecDescent->new($BooleanExpr);
>
>my $result = $parser->boolean("fish and not(salmon or carp)");
and so on.
--
Regards
Stephen Collyer
Netspinner Ltd
More information about the london.pm
mailing list