SYNOPSIS
A simple command line invocation:sim4db -genomic g.fasta -cdna c.fasta -scr script -output o.sim4db
where:
- 'c.fasta' and 'g.fasta' are the multi-fasta cDNA and genome sequence files
- 'script' is a script file indicating individual alignments to be computed
- output in sim4db format will be sent to the file 'o.sim4db' ('-' for standard output)
A more complex invocation:
sim4db -genomic g.fasta -cdna c.fasta -output o.sim4db [options]
DESCRIPTION
sim4db performs fast batch alignment of large cDNA (EST, mRNA) sequence sets to a set of eukaryotic genomic regions. It uses the sim4 and sim4cc algorithms to determine the alignments, but incorporates a fast sequence indexing and retrieval mechanism, implemented in the sister package leaff(1), to speedily process large volumes of sequences.While sim4db produces alignments in the same way as sim4 or sim4cc, it has additional features to make it more amenable for use with whole-genome annotation pipelines. A script file can be used to group pairings between cDNAs and their corresponding genomic regions, to be aligned as one run and using the same set of parameters. Sim4db also optionally reports more than one alignment for the same cDNA within a genomic region, as long as they meet user-defined criteria such as minimum length, percentage sequence identity or coverage. This feature is instrumental in finding all alignments of a gene family at one locus. Lastly, the output is presented either as custom sim4db alignments or as GFF3 gene features.
OPTIONS
Salient options:-cdna use these cDNA sequences (multi-fasta file)
-genomic use these genomic sequences (multi-fasta file)
-script use this script file
-pairwise sequentially align pairs of sequences
If none of the '-script' and '-pairwise' options
is specified, sim4db performs all-against-all
alignments between pairs of cDNA and genomic sequences.
-output write output to this file
-gff3 report output in GFF3 format
-interspecies use sim4cc for inter-species alignments (default sim4)
Filter options:
-mincoverage iteratively find all exon models with the specified
minimum PERCENT COVERAGE
-minidentity iteratively find all exon models with the specified
minimum PERCENT EXON IDENTITY
-minlength iteratively find all exon models with the specified
minimum ABSOLUTE COVERAGE (number of bp matched)
(default 0)
-alwaysreport always report <number> exon models, even if they
are below the quality thresholds
If no mincoverage or minidentity or minlength is given, only
the best exon model is returned. This is the DEFAULT operation.
You will probably want to specify ALL THREE of mincoverage,
minidentity and minlength! Don't assume the default values
are what you want!
You will DEFINITELY want to specify at least one of mincoverage,
minidentity and minlength with alwaysreport! If you don't,
mincoverage will be set to 90 and minidentity to 95 -- to reduce
the number of spurious matches when a good match is found.
Auxiliary options:
-nodeflines don't include the defline in the sim4db output
-alignments print alignments
-polytails DON'T mask poly-A and poly-T tails
-cut trim marginal exons if A/T % > x (poly-AT tails)
-noncanonical don't force canonical splice sites
-splicemodel use the following splice model: 0 - original sim4;
1 - GeneSplicer; 2 - Glimmer; options 1 and 2 are
only available with '-interspecies'.
Default for sim4 is 0, and for sim4cc is 1.
-forcestrand Force the strand prediction to always be
one of 'forward' or 'reverse'
Execution options:
-threads Use n threads.
-touch create this file when the program finishes execution
Debugging options:
-v print status to stderr while running
-V print script lines (stderr) as they are being processed
Developer options:
-Z set the spaced seed pattern
-H set the relink weight factor (H=1000 recommended for mRNAs)
-K set the first MSP threshold
-C set the second MSP threshold
-Ma set the limit of the number of MSPs allowed
-Mp same, as percentage of bases in cDNA
NOTE: If used, both -Ma and -Mp must be specified!