Search::QueryParser(3) parses a query string into a data structure

SYNOPSIS


my $qp = new Search::QueryParser;
my $s = '+mandatoryWord -excludedWord +field:word "exact phrase"';
my $query = $qp->parse($s) or die "Error in query : " . $qp->err;
$someIndexer->search($query);
# query with comparison operators and implicit plus (second arg is true)
$query = $qp->parse("txt~'^foo.*' date>='01.01.2001' date<='02.02.2002'", 1);
# boolean operators (example below is equivalent to "+a +(b c) -d")
$query = $qp->parse("a AND (b OR c) AND NOT d");
# subset of rows
$query = $qp->parse("Id#123,444,555,666 AND (b OR c)");

DESCRIPTION

This module parses a query string into a data structure to be handled by external search engines. For examples of such engines, see File::Tabular and Search::Indexer.

The query string can contain simple terms, ``exact phrases'', field names and comparison operators, '+/-' prefixes, parentheses, and boolean connectors.

The parser can be parameterized by regular expressions for specific notions of ``term'', ``field name'' or ``operator'' ; see the new method. The parser has no support for lemmatization or other term transformations : these should be done externally, before passing the query data structure to the search engine.

The data structure resulting from a parsed query is a tree of terms and operators, as described below in the parse method. The interpretation of the structure is up to the external search engine that will receive the parsed query ; the present module does not make any assumption about what it means to be ``equal'' or to ``contain'' a term.

QUERY STRING

The query string is decomposed into ``items'', where each item has an optional sign prefix, an optional field name and comparison operator, and a mandatory value.

Sign prefix

Prefix '+' means that the item is mandatory. Prefix '-' means that the item must be excluded. No prefix means that the item will be searched for, but is not mandatory.

As far as the result set is concerned, "+a +b c" is strictly equivalent to "+a +b" : the search engine will return documents containing both terms 'a' and 'b', and possibly also term 'c'. However, if the search engine also returns relevance scores, query "+a +b c" might give a better score to documents containing also term 'c'.

See also section ``Boolean connectors'' below, which is another way to combine items into a query.

Field name and comparison operator

Internally, each query item has a field name and comparison operator; if not written explicitly in the query, these take default values '' (empty field name) and ':' (colon operator).

Operators have a left operand (the field name) and a right operand (the value to be compared with); for example, "foo:bar" means ``search documents containing term 'bar' in field 'foo''', whereas "foo=bar" means ``search documents where field 'foo' has exact value 'bar'''.

Here is the list of admitted operators with their intended meaning :

":"
treat value as a term to be searched within field. This is the default operator.
"~" or "=~"
treat value as a regex; match field against the regex.
"!~"
negation of above
"==" or "=", "<=", ">=", "!=", "<", ">"
classical relational operators
"#"
Inclusion in the set of comma-separated integers supplied on the right-hand side.

Operators ":", "~", "=~", "!~" and "#" admit an empty left operand (so the field name will be ''). Search engines will usually interpret this as ``any field'' or ``the whole data record''.

Value

A value (right operand to a comparison operator) can be
  • just a term (as recognized by regex "rxTerm", see new method below)
  • A quoted phrase, i.e. a collection of terms within single or double quotes.

    Quotes can be used not only for ``exact phrases'', but also to prevent misinterpretation of some values : for example "-2" would mean ``value '2' with prefix '-''', in other words ``exclude term '2''', so if you want to search for value -2, you should write "-2" instead. In the last example of the synopsis, quotes were used to prevent splitting of dates into several search terms.

  • a subquery within parentheses. Field names and operators distribute over parentheses, so for example "foo:(bar bie)" is equivalent to "foo:bar foo:bie". Nested field names such as "foo:(bar:bie)" are not allowed. Sign prefixes do not distribute : "+(foo bar) +bie" is not equivalent to "+foo +bar +bie".

Boolean connectors

Queries can contain boolean connectors 'AND', 'OR', 'NOT' (or their equivalent in some other languages). This is mere syntactic sugar for the '+' and '-' prefixes : "a AND b" is translated into "+a +b"; "a OR b" is translated into "(a b)"; "NOT a" is translated into "-a". "+a OR b" does not make sense, but it is translated into "(a b)", under the assumption that the user understands ``OR'' better than a '+' prefix. "-a OR b" does not make sense either, but has no meaningful approximation, so it is rejected.

Combinations of AND/OR clauses must be surrounded by parentheses, i.e. "(a AND b) OR c" or "a AND (b OR c)" are allowed, but "a AND b OR c" is not.

METHODS

new
  new(rxTerm   => qr/.../, rxOp => qr/.../, ...)

Creates a new query parser, initialized with (optional) regular expressions :

rxTerm
Regular expression for matching a term. Of course it should not match the empty string. Default value is "qr/[^\s()]+/". A term should not be allowed to include parenthesis, otherwise the parser might get into trouble.
rxField
Regular expression for matching a field name. Default value is "qr/\w+/" (meaning of "\w" according to "use locale").
rxOp
Regular expression for matching an operator. Default value is "qr/==|<=|>=|!=|=~|!~|:|=|<|>|~/". Note that the longest operators come first in the regex, because ``alternatives are tried from left to right'' (see ``Version 8 Regular Expressions'' in perlre) : this is to avoid "a<=3" being parsed as "a < '=3'".
rxOpNoField
Regular expression for a subset of the operators which admit an empty left operand (no field name). Default value is "qr/=~|!~|~|:/". Such operators can be meaningful for comparisons with ``any field'' or with ``the whole record'' ; the precise interpretation depends on the search engine.
rxAnd
Regular expression for boolean connector AND. Default value is "qr/AND|ET|UND|E/".
rxOr
Regular expression for boolean connector OR. Default value is "qr/OR|OU|ODER|O/".
rxNot
Regular expression for boolean connector NOT. Default value is "qr/NOT|PAS|NICHT|NON/".
defField
If no field is specified in the query, use defField. The default is the empty string "".
parse
  $q = $queryParser->parse($queryString, $implicitPlus);

Returns a data structure corresponding to the parsed string. The second argument is optional; if true, it adds an implicit '+' in front of each term without prefix, so "parse("+a b c -d", 1)" is equivalent to "parse("+a +b +c -d")". This is often seen in common WWW search engines as an option ``match all words''.

The return value has following structure :

  { '+' => [{field=>'f1', op=>':', value=>'v1', quote=>'q1'}, 
            {field=>'f2', op=>':', value=>'v2', quote=>'q2'}, ...],
    ''  => [...],
    '-' => [...]
  }

In other words, it is a hash ref with 3 keys '+', '' and '-', corresponding to the 3 sign prefixes (mandatory, ordinary or excluded items). Each key holds either a ref to an array of items, or "undef" (no items with this prefix in the query).

An item is a hash ref containing

"field"
scalar, field name (may be the empty string)
"op"
scalar, operator
"quote"
scalar, character that was used for quoting the value ('``', '''" or undef)
"value"
Either
  • a scalar (simple term), or
  • a recursive ref to another query structure. In that case, "op" is necessarily '()' ; this corresponds to a subquery in parentheses.

In case of a parsing error, "parse" returns "undef"; method err can be called to get an explanatory message.

err
  $msg = $queryParser->err;

Message describing the last parse error

unparse
  $s = $queryParser->unparse($query);

Returns a string representation of the $query data structure.

AUTHOR

Laurent Dami, <laurent.dami AT etat ge ch>

COPYRIGHT AND LICENSE

Copyright (C) 2005, 2007 by Laurent Dami.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.