Zerg(3) a lexical scanner for BLAST reports.

SYNOPSIS

use Zerg;

DESCRIPTION

This manpage describes the Zerg library and its interface for use with Perl.

The Zerg library contains a C/flex lexical scanner for BLAST reports and a set of supporting functions. It is centered on a ``get_token'' function that scans the input for specified lexical elements and, when one is found, returns its code and value to the user.

It is intended to be fast: for that we used flex, which provides simple regular expression matching and input buffering in the generated C scanner. And it is intended to be simple in the sense of providing just a lexical scanner, with no features whose support could slow down its main function.

FUNCTIONS

zerg_get_token() is the core function of this module. Each time it is called, it scans the input BLAST report for the next ``interesting'' lexical element and returns its code and value. Codes are listed in the section ``EXPORTED CONSTANTS (TOKEN CODES)''. Code zero (not listed) means end of file.

  ($code, $value) = Zerg::zerg_get_token();

zerg_open_file($filename) opens $filename in read-only mode and set it as the input to the scanner. If this function is not called, the standard input is used.

  Zerg::zerg_open_file($filename);

zerg_close_file() closes the file opened with zerg_open_file().

zerg_get_token_offset() returns the byte offset (relative to the beginning of file) of the last token read. (See section BUGS).

zerg_ignore($code) instructs zerg_get_token not to return when it finds a token with code $code.

zerg_ignore_all() does zerg_ignore on all token codes.

zerg_unignore($code) instructs zerg_get_token to return when it finds a token with code $code.

zerg_unignore_all() does zerg_unignore on all token codes.

  Example:
  Zerg::zerg_ignore_all();
  Zerg::zerg_unignore(QUERY_NAME);
  Zerg::zerg_unignore(SUBJECT_NAME);

EXPORTED CONSTANTS (TOKEN CODES)

    ALIGNMENT_LENGTH    
    BLAST_VERSION               
    CONVERGED           
    DATABASE            
    DESCRIPTION_ANNOTATION      
    DESCRIPTION_EVALUE  
    DESCRIPTION_HITNAME         
    DESCRIPTION_SCORE   
    END_OF_REPORT               
    EVALUE                      
    GAPS                        
    HSP_METHOD          
    IDENTITIES          
    NOHITS                      
    PERCENT_IDENTITIES  
    PERCENT_POSITIVES   
    POSITIVES           
    QUERY_ALI           
    QUERY_ANNOTATION    
    QUERY_END           
    QUERY_FRAME                 
    QUERY_LENGTH                
    QUERY_NAME          
    QUERY_ORIENTATION   
    QUERY_START                 
    REFERENCE           
    ROUND_NUMBER                
    ROUND_SEQ_FOUND     
    ROUND_SEQ_NEW               
    SCORE                       
    SCORE_BITS          
    SEARCHING           
    SUBJECT_ALI                 
    SUBJECT_ANNOTATION  
    SUBJECT_END                 
    SUBJECT_FRAME               
    SUBJECT_LENGTH              
    SUBJECT_NAME                
    SUBJECT_ORIENTATION         
    SUBJECT_START               
    TAIL_OF_REPORT              
    UNMATCHED

NOTES ON THE SCANNER

Some BLAST parsers rely on some simple regular expression matches to conclude about token types and values. For example: an input line matching /^Query=\s(\S+)/ should make such a ``loose'' parser to infer that a token was found, it is a query name and its value is $1. Although improbable, it is perfectly possible for an anotation field to match /^Query=\s(\S+)/. Worse than this is the fact that those parsers are often unable to detect corrupt or truncated BLAST reports, possibly producing inaccurate information.

The scanner provided by this library is much more stringent: for a token to match it must be in its place in the context of a BLAST report. For example: in a single BLAST report, a QUERY_NAME cannot follow another QUERY_NAME. The scanner can be thought as, and in fact it is, a big regular expression that matches an entire BLAST report.

A special token code (UNMATCHED) is provided for cases in which the input text does not match any other lexical rule of the scanner. When an umnacthed character is found, either the report is corrupt or the scanner has a bug.

If you are interested in only a few token codes, try to zerg_ignore() as much codes you can. This will avoid unnecessary function calls that eat a lot of CPU.

EXAMPLES

This program prints the code and the value of each token it finds.

  #!/usr/bin/perl -w
  use strict;
  use Zerg;
  my ($code, $value);
  while((($code, $value)= Zerg::zerg_get_token()) && $code)
  {
      print "$code\t$value\n";
  }

The program below is a ``syntax checker''. The presence of UNMATCHEDs is a strong indicator of problems in the BLAST report. (See section NOTES ON THE SCANNER)

  #!/usr/bin/perl -w
  use strict;
  use Zerg;
  my ($code, $value);
  Zerg::zerg_ignore_all();
  Zerg::zerg_unignore(UNMATCHED);
  while((($code, $value)= Zerg::zerg_get_token()) && $code)
  {
      print "UNMATCHED CHAR:\t$value\n";
  }

BUGS

The tokens DESCRIPTION_ANNOTATION, DESCRIPTION_SCORE and DESCRIPTION_EVALUE are scanned all at once and released one by one on user request. So, if the user wants to get any of these fields, they must be unignored BEFORE scanning DESCRIPTION_ANNOTATION.

zerg_get_token_offset() may return incorrect values for these tokens and those that are modified by the parser, namely: QUERY_LENGTH, SUBJECT_LENGTH, EVALUE, GAPS.

TODO

Add more tokens to the scanner as the need for that appears.

AUTHOR

ApuĆ£ Paquola, IQ-USP Bioinformatics Lab, [email protected]

Laszlo Kajan <[email protected]>, Technical University of Munich, Germany