sim4db(1) batch spliced alignment of cDNA sequences to a target genome


A simple command line invocation:

sim4db -genomic g.fasta -cdna c.fasta -scr script -output o.sim4db

   - 'c.fasta' and 'g.fasta' are the multi-fasta cDNA and genome sequence files
   - 'script' is a script file indicating individual alignments to be computed
   - output in sim4db format will be sent to the file 'o.sim4db' ('-' for standard output)

A more complex invocation:

sim4db -genomic g.fasta -cdna c.fasta -output o.sim4db [options]


sim4db performs fast batch alignment of large cDNA (EST, mRNA) sequence sets to a set of eukaryotic genomic regions. It uses the sim4 and sim4cc algorithms to determine the alignments, but incorporates a fast sequence indexing and retrieval mechanism, implemented in the sister package leaff(1), to speedily process large volumes of sequences.

While sim4db produces alignments in the same way as sim4 or sim4cc, it has additional features to make it more amenable for use with whole-genome annotation pipelines. A script file can be used to group pairings between cDNAs and their corresponding genomic regions, to be aligned as one run and using the same set of parameters. Sim4db also optionally reports more than one alignment for the same cDNA within a genomic region, as long as they meet user-defined criteria such as minimum length, percentage sequence identity or coverage. This feature is instrumental in finding all alignments of a gene family at one locus. Lastly, the output is presented either as custom sim4db alignments or as GFF3 gene features.


Salient options:
       -cdna         use these cDNA sequences (multi-fasta file)
       -genomic      use these genomic sequences (multi-fasta file)
       -script       use this script file
       -pairwise     sequentially align pairs of sequences

                     If none of the '-script' and '-pairwise' options
                     is specified, sim4db performs all-against-all
                     alignments between pairs of cDNA and genomic sequences.

       -output       write output to this file
       -gff3         report output in GFF3 format
       -interspecies use sim4cc for inter-species alignments (default sim4)

Filter options:
       -mincoverage  iteratively find all exon models with the specified
                     minimum PERCENT COVERAGE
       -minidentity  iteratively find all exon models with the specified
                     minimum PERCENT EXON IDENTITY
       -minlength    iteratively find all exon models with the specified
                     minimum ABSOLUTE COVERAGE (number of bp matched)
                     (default 0)
       -alwaysreport always report <number> exon models, even if they
                     are below the quality thresholds

         If no mincoverage or minidentity or minlength is given, only
         the best exon model is returned. This is the DEFAULT operation.

         You will probably want to specify ALL THREE of mincoverage,
         minidentity and minlength!  Don't assume the default values
         are what you want!

         You will DEFINITELY want to specify at least one of mincoverage,
         minidentity and minlength with alwaysreport!  If you don't,
         mincoverage will be set to 90 and minidentity to 95 -- to reduce
         the number of spurious matches when a good match is found.

Auxiliary options:
       -nodeflines   don't include the defline in the sim4db output
       -alignments   print alignments

       -polytails    DON'T mask poly-A and poly-T tails
       -cut          trim marginal exons if A/T % > x (poly-AT tails)

       -noncanonical don't force canonical splice sites
       -splicemodel  use the following splice model: 0 - original sim4;
                     1 - GeneSplicer; 2 - Glimmer;  options 1 and 2 are
                     only available with '-interspecies'.
                     Default for sim4 is 0, and for sim4cc is 1.
       -forcestrand  Force the strand prediction to always be
                     one of 'forward' or 'reverse'

Execution options:
       -threads      Use n threads.
       -touch        create this file when the program finishes execution

Debugging options:
       -v            print status to stderr while running
       -V            print script lines (stderr) as they are being processed

Developer options:
       -Z            set the spaced seed pattern
       -H            set the relink weight factor (H=1000 recommended for mRNAs)
       -K            set the first MSP threshold
       -C            set the second MSP threshold
       -Ma           set the limit of the number of MSPs allowed
       -Mp           same, as percentage of bases in cDNA
                     NOTE:  If used, both -Ma and -Mp must be specified!