KinoSearch::Docs::DevGuide(3) hacking/debugging KinoSearch

SYNOPSIS

Developer-only documentation. If you just want to build a search engine, you probably don't need to read this.

Fundamental Classes

Most of the classes in KinoSearch rely on KinoSearch::Util::Class and KinoSearch::Util::ToolSet, so you'll probably want to familiarize yourself with them.

Object Oriented Design

No public member variables.

Multiple classes defined within a single source-code file, e.g. TermQuery and TermWeight, may use direct access to get at each others member variables. Everybody else has to use accessor methods.

C-struct based classes such as TermInfo allow direct access to their members, but only from C (of course).

Subroutine/method access levels

There are three access levels in KinoSearch.
1.
public: documented in ``visible'' pod.
2.
private: subs which are prepended with an _underscore may only be used within the package in which they reside --- as per perlstyle guidelines --- and in only one source file.
3.
distro: any sub which doesn't fall into either category above may be used anywhere within the KinoSearch distribution.

Documentation Conventions

KinoSearch's public API is defined by what you get when you run the suite through a well-behaved pod-to-whatever converter. Developer-only documentation is limited to comments and ``invisible'' =for/=begin POD blocks.

Integration of XS and C code

XS and C code in KinoSearch is stored faux-Inline-style, after an "__END__" token, and delimited by either "__XS__", "__H__", or "__C__". A heavily customized Build.PL detects these code blocks and writes out hard files at install-time, so the inlining is mostly for convenience while editing: the XS code is often tightly coupled to the Perl code in a given module, and having everything in one place makes it easier to see what's going on and move things back and forth.

Build.PL writes out separate .h and .c files for each block it finds, but all the XS blocks are concatenated into a single file --- KinoSearch.xs. The content of KinoSearch.xs consists of the XS block from KinoSearch.pm, followed by all the other XS blocks in an undetermined order. Ultimately, only a single compiled library gets installed along with the Perl modules.

At runtime, the only module which calls XSLoader::load is KinoSearch. Because the KinoSearch "MODULE" has many "PACKAGE"s, "use KinoSearch;" loads all of the XS routines in the entire KinoSearch suite. A pure-Perl version of KinoSearch.pm which did the same thing might look like this...

    package KinoSearch;
    our $VERSION = 1.0;
    package KinoSearch::Index::TermInfo;
    
    sub get_doc_freq {
        # ...
    }
    
    package KinoSearch::Store::InStream;
    
    sub lu_read {
        # ...
    }
    
    # ...

Since KinoSearch.xs is only generated/modified when Build.PL is run, an extra command line call to Build.PL has to be integrated into the development workflow when working on XS or C material.

    % perl Build.PL; ./Build code; perl -Mblib t/some_test.t

Build.PL tracks modification times, using them to determine whether it needs to recompile anything. If only pure Perl modules have been edited, it won't force needless recompilation, and if only a limited number of .pm files containing XS/C/H code have been edited, it will recompile as little as it can.

Divison of labor between Perl, XS, and C

Technically, most or all of the C code in KinoSearch could be stuffed into XS functions. However, the mechanics of moving data across the Perl/C boundary are complicated, and the added cruft tends to make what's going on in the C code more difficult to grok.

To maximize clarity, when possible XS in KinoSearch is limited to ``glue'' code, while Perl and C do the heavy lifting. Exceptions occur when XS functions need to manipulate the Perl stack, for instance when returning more than one value.

Relationship to Lucene

API Differences, both public and private

Since there is no method overloading by signature in Perl, it's impossible for KinoSearch to mimic Lucene's API exactly. For a variety of other reasons, most of them performance-related, it's difficult or inadvisable to even try. The most crucial API distinction between KinoSearch and Lucene relates to indexing: KinoSearch, like its predecessor, Search::Kinosearch, forces you to specify field definitions in advance; in contrast, Lucene employs an elaborate apparatus behind the scenes which adapts field definitions on the fly. If KinoSearch were to implement that apparatus, indexing would be substantially slower than it is.

Since KinoSearch could only muster a poor imitation of Lucene's API at best, it doesn't work very hard at imitating it --- it just tries to be a good, idiomatic Perl search API. Here's a sampling of other ways in which KinoSearch and Lucene differ:

  • In Lucene both IndexWriter and IndexReader can be used to modify an index, while in KinoSearch all index modifications are performed via the InvIndexer class.
  • Since Perl's filehandles can be used to write to scalars, there's little benefit to having separate classes like RAMIndexOutput and FSIndexOutput, and KinoSearch only has two IO classes: InStream and OutStream.
  • Certain classes have had their names transformed in KinoSearch, usually by shortening (SegmentTermDocs => SegTermDocs, Document => Doc), but sometimes arbitrarily (Directory => InvIndex).
  • Other classes are specific to KinoSearch, e.g. PostingsWriter, PolyAnalyzer. These are either scavenged from Search::Kinosearch, or written from scratch.
  • A lot of the common method names in Lucene are core keywords in Perl: length, close, seek, read, write. Where it seemed important to disabiguate those, they have been changed --- e.g., for a Perl programmer accustomed to the passive behavior of CORE::close, having a close() method that triggers file-writes is counter-intutive, so such behaviors have typically been shunted into a method called finish().

Nevertheless, developers who speak both Perl and Java shouldn't find it too difficult to move back and forth between Lucene and KinoSearch --- as long as what they want to do is within the scope of KinoSearch's leaner API. At the time of this writing KinoSearch's public API is artificially small because some pieces aren't done, and it's a young library so preserving the flexibility to change the internal implementation is paramount. But KinoSearch does not aspire to do all the things that Lucene does --- only the things that can be done well, and maybe a few tricks of its own.

Thematic architectural differences

The relatively low object-oriented overhead of Java makes possible some architectures in Lucene which do not translate well to Perl. For instance, arrays are often accumulated in Lucene by calling an iterator's next() method over and over. Method calls in Perl, which are always dynamic, are more costly, making this technique impractical in areas which are both data-intensive and performance-critical. Most often, KinoSearch solves the problem by moving the algorithm to a for/while loop in either Perl or C. In a minority of cases, frequently called methods are emulated using either a globally exposed C function or C pointer-to-function.

Lucene is also a profligate wastrel when it comes to creating and destroying objects --- it can afford to be, because Java objects come cheap. For example, during the indexing phase Lucene creates a mini-inverted-index for each document, complete with its own FieldsWriter, DocumentWriter, and so on. Having Perl objects do the same thing in the same way yields disappointing execution speed. Therefore, KinoSearch does the same thing, but in a different way: it uses a completely different merge model, based around serialization and external sorting, which makes it possible to write indexes by-the-segment rather than by-the document --- slashing the OO requirements and largely offsetting the extra overhead.

Similar architectural differences exist throughout KinoSearch. While the file-format dictates certain algorithms, when idiomatic alternatives presented themselves or were already known, they were adopted.

Search Logic

KinoSearch's search logic differs from Lucene's. In particular, very short fields do not contribute as much to a score, making it essential to assign a <boost> of greater than 1 to short but important fields, e.g. ``title''.

File Format

Lucene's file format is a beautiful thing. KinoSearch was retooled to use it because it didn't seem possible to improve on it more than superficially. It's somewhat hard to work with, and impossible to work with efficiently using only native Perl, but it has both data-compression and search-time-execution-speed nailed.

Unfortunately, the way Lucene defines a String has a Java-centric quirk that deep-sixes prospects for 100% index compatibility, and the only realistic solutions involve changes to Lucene. In theory, these changes (moving to bytecount-based Strings) should actually improve Lucene's performance, and the Lucene development community has expressed openness to them. At some point, time may be found to write these win-win patches, but at present working on KinoSearch itself takes higher priority.

Here's a list of the ways in which KinoSearch's file format differ's from Lucene's:

  • Strings - Lucene's file format specifies that a String is a VInt count of the number of java chars a string occupies, followed by the character data encoded in Sun's ``modified UTF-8'' format (a misnamed, illegal variant, unsanctioned by the Unicode Consortium). Dealing with this format in Perl has proven to be so fiddly and performance-killing as to be not worth the effort. At present, KinoSearch's String format is: a byte count, followed by arbitrary data.
  • Field Numbers - KinoSearch requires that field numbers correspond to lexically sorted order of field names; Lucene indexes are unlikely to have this property. KinoSearch 0.05 had compatibility code which dealt with unordered field numbers, but that code was removed in 0.06. It can be restored if need be.
  • Term Vectors - KinoSearch groups term vector data with stored field data; Lucene uses separate files. Keeping term vector data and field data together minimizes disk activity for common usage patterns. It's also a little simpler to maintain. However, if the String issue is resolved in a future version of Lucene, KinoSearch will likely switch to Lucene's format.

Coding style

The gist: when possible, KinoSearch's Perl code follows the recommendations set out in Damian Conway's book, ``Perl Best Practices'', and its XS/C code more or less (no tab characters allowed) emulates the style of the source code for Perl itself.

Perl code is auto-formatted using a PerlTidy-based helper app called kinotidy, which is basically perltidy with a profile set up to use the PBP settings.

It would be nice if there were a formatter for XS and C code that was as good as PerlTidy. Since there isn't, the code is manually set to look as though it was, with one important difference: a bias towards maximum parenthetical tightness.

In both Perl and XS/C, code is organized into commented paragraphs a few lines in length, as per PBP recommendations. Strong efforts are made to keep the comment to a single line. Stupefyingly obvious ``code narration'' comments are used when something more literate doesn't present itself --- the goal is to be able to grok the intended flow of a function by scanning the first line of each ``paragraph''.

COPYRIGHT

Copyright 2005-2009 Marvin Humphrey

LICENSE, DISCLAIMER, BUGS, etc.

See KinoSearch version 0.165.