cmscore(1) align and score one or more sequences to a CM

SYNOPSIS

cmscore [options] cmfile seqfile

DESCRIPTION

cmscore uses the covariance model (CM) in cmfile to align and score the sequences in seqfile, and output summary statistics on timings and scores. cmscore is a testbed for new CM alignment algorithms, and it is also used by the testsuite. It is not intended to be particularly useful in the real world. Documentation is provided for completeness, and to aid our own memories.

CMs are profiles of RNA consensus sequence and secondary structure. A CM file is produced by the cmbuild program, from a given RNA sequence alignment of known consensus structure.

cmscore aligns the sequence(s) in seqfile using two alignment algorithms and compares the scores and timings of each algorithm. By default the two algorithms compared are the divide and conquer (D&C) CYK algorithm (SR Eddy, BMC Bioinformatics 3:18, 2002), and the HMM banded standard CYK algorithm. When the --nonbanded option is enabled D&C CYK is compared with the standard CYK alignment algorithm. In this case, because both algorithms should find the optimal alignment the parsetree scores should be nearly identical (within 0.01 bits), if this is not the case for any sequence cmscore exits and prints an error message. When the --viterbi option is enabled D&C CYK is compared with Viterbi alignment to a CM Plan 9 HMM constructed to be maximally similar to the CM.

While non-banded CYK variants are guaranteed to find the optimal alignment and score of each sequence, HMM banded CYK sacrifices this guarantee for acceleration. The level of acceleration can be controlled by the tau parameter, which is set with the --tau <x> option. This is described in more detail in the man page for cmalign, but in short, <x> is a rough estimate at the probability that the optimal alignment will be missed. The greater <x> is, the greater the acceleration, but the greater the chance of missing the optimal alignment. By default tau is set as 1E-7. cmscore is useful for testing for values of tau that give the best trade-off between acceleration versus accuracy. To make this testing easier, multiple tau values can be tested within a single cmscore call. The --taus <x> and --taue <x> option combination allow the user to specify a beginning tau value and an ending tau value. For example, --taus 3 and and --taue 5 would first align the sequences in seqfile with non-banded D&C CYK, and then perform 3 additional HMM banded alignments, first with tau=1E-3, next with tau=1E-4 and finally with tau=1E-5. Currently, only values of 1E-<x> can be used. Summary statistics on timings and how often the optimal alignment is missed for each value or tau are printed to stdout.

When comparing the non-banded standard CYK and D&C CYK algorithms with the --nonbanded option, the two parse trees should usually be identical for any sequence, because the optimal alignment score is guaranteed. However, there can be cases of ties, where two or more different parse trees have identical scores. In such cases, it is possible for the two parse trees to differ. The parse tree selected as "optimal" from amongst the ties is arbitrary, dependent on order of evaluation in the DP traceback, and the order of evaluation for D&C vs. standard CYK is different. Thus, in its testsuite role, cmscore checks that the scores are within 0.01 bits of each other, but does not check that the parse trees are absolutely identical.

The alignment algorithms can be run in "search" mode within cmscore by using the --search option. When --search is enabled, --inside specifies that the Inside algorithm be used instead of CYK and --forward specifies that the HMM Forward algorithm be used instead of CYK.

The sequences are treated as single stranded RNAs; that is, only the given strand of each sequence is aligned and scored, and no reverse complementing is done.

OPTIONS

-h
Print brief help; includes version number and summary of all options, including expert options.

-n <n>
Set the number of sequences to generate and align to <n>. This option is incompatible with the --infile option.

-l
Turn on the local alignment algorithm, which allows the alignment to span two or more subsequences if necessary (e.g. if the structures of the query model and target sequence are only partially shared), allowing certain large insertions and deletions in the structure to be penalized differently than normal indels. The default is to globally align the query model to the target sequences.

-s <n>
Set the random seed to <n>, where <n> is a positive integer. The default is to use time() to generate a different seed for each run, which means that two different runs of cmscore on the same CM will give different results. You can use this option to generate reproducible results. The random number generator is used to generate sequences to score, so -s is incompatible with the --infile option which supplies the sequences to score in an input file.

-a
Print individual timings and score comparisons for each sequence in seqfile. By default only summary statistics are printed.

--sub
Turn on the sub model construction and alignment procedure. For each sequence, an HMM is first used to predict the model start and end consensus columns, and a new sub CM is constructed that only models consensus columns from start to end. The sequence is then aligned to this sub CM. This option is useful for aligning sequences that are known to truncated, non-full length sequences. This "sub CM" procedure is not the same as the "sub CMs" described by Weinberg and Ruzzo. When used in combination with --tfile the parsetree printed is not the sub CM parsetree, but rather the sub CM parstree mapped onto the topology of the original CM. This mapped parsetree will likely have a different score (sometimes much worse) than the sub CM parsetree, both of those scores are printed to the parsetree file for each sequence.

--mxsize <x>
Set the maximum allowable DP matrix size to <x> megabytes. By default this size is 2048 Mb. This should be large enough for the most alignments, however if it is not cmscore will exit prematurely and report an error message that the matrix exceeded it's maximum allowable size. In this case, the --mxsize can be used to raise the limit, or if --nonbanded is enabled, the --scoreonly option will solve the memory issue. This memory error is most likely to occur when the --nonbanded option is used without the --scoreonly option, but can still occur when --nonbanded is not used.

--devhelp
Print help, as with -h, but also include undocumented developer options. These options are not listed below, are under development or experimental, and are not guaranteed to even work correctly. Use developer options at your own risk. The only resources for understanding what they actually do are the brief one-line description printed when --devhelp is enabled, and the source code.

--mpi
Run as an MPI parallel program. This option will only be available if Infernal has been configured and built with the --enable-mpi flag (see User's Guide for details).

EXPERT OPTIONS

--emit
Generate sequences to score by sampling from the CM. This option is on by default. The number of sequences generated is 10 by default but can be changed with the -n option. The sequences generated from the CM can be saved to an output file in FASTA format with the --outfile option.

--random
Generate sequences to score by sampling from the CMs null distribution. This option turns the --emit option off. By default the CM distribution will be 0.25 for each of the four RNA nucleotides, but it may be different if the --null option was used when cmbuild created the cmfile. By default, the length of the sequences generated is sampled from the length distribution of the CM. The average length of the random sequences will be the consensus length of the RNA family modelled by the CM (or very close to it). Alternatively, the --Lmin <n> and --Lmax <n> options can be used to specify a length distribution. The number of sequences generated is 10 by default but can be changed with the -n option. The random sequences generated can be saved to an output file in FASTA format with the --outfile option.

--infile <f>
Sequences to score are read from the file <f>. All the sequences from <f> are read and scored, the -n and -s options are incompatible with --infile.

--outfile <f>
Save generated sequences that are scored to the file <f> in FASTA format. This option is incompatible with the --infile option.

--Lmin <n1>
Must be used in combination with --random and --Lmax <n2>. The lengths of the random sequences generated and scored will be uniform between the range of <n1>..<n2>.

--Lmax <n2>
Must be used in combination with --random and --Lmin <n1>. The lengths of the random sequences generated and scored will be uniform between the range of <n1>..<n2>.

--pad
Must be used in combination with --emit and --search. Add <n> cm->W (max hit length) minus L (sequence <x> length) residues to the 5' and 3' end of each emitted sequence <x>.

--hbanded
Specify that the second stage alignment algorithm be HMM banded CYK. This option is on by default. For more information on this option, see the description of the --hbanded option in the man page for cmalign.

--tau <x>
For stage 2 alignment, set the tail loss probability used during HMM band calculation to <x>. This is the amount of probability mass within the HMM posterior probabilities that is considered negligible. The default value is 1E-7. In general, higher values will result in greater acceleration, but increase the chance of missing the optimal alignment due to the HMM bands.

--aln2bands
With --search, when calculating HMM bands, use an HMM alignment algorithm instead of an HMM search algorithm. In general, using this option will result in greater acceleration, but will increase the chance of missing the optimal alignment.

--hsafe
For stage 2 HMM banded alignment, realign any sequences with a negative alignment score using non-banded CYK to guarantee finding the optimal alignment.

--nonbanded
Specify that the second stage alignment algorithm be standard, non-banded, non-D&C CYK. When --nonbanded is enabled, the program fails with a non-zero exit code and prints an error message if the parsetree score for any sequence from stage 1 D&C alignment and stage 2 alignment differs by more than 0.01 bits. In theory, this should never happen as both algorithms are guaranteed to determine the optimal parsetree. For larger RNAs (more than 300 residues) if memory is limiting, --nonbanded should be used in combination with --scoreonly.

--scoreonly
With --nonbanded during the second stage standard non-banded CYK alignment, use the "score only" variant of the algorithm to save memory, and don't recover a parse tree.

--viterbi
Specify that the second stage alignment algorithm be Viterbi to a CM Plan 9 HMM.
--search
Run all algorithms in scanning mode, not alignment mode. This means the highest scoring subsequence within each sequence is returned as the score, not necessarily the score of an alignment of the full sequence.

--inside
With --search Compare the non-banded scanning Inside algorithm to the HMM banded scanning Inside algorith, instead of using CYK versions.

--forward
With --search Compare the scanning Forward scoring algorithm against CYK.

--taus <n>
Specify the first alignment algorithm as non-banded D&C CYK, and multiple stages of HMM banded CYK alignment. The first HMM banded alignment will use tau=1E-<x>, which will be the highest value of tau used. Must be used in combination with --taue.

--taue <n>
Specify the first alignment algorithm as non-banded D&C CYK, and multiple stages of HMM banded CYK alignment. The final HMM banded alignment will use tau=1E-<x>, which will be the lowest value of tau used. Must be used in combination with --taus.

--tfile <f>
Print the parsetrees for each alignment of each sequence to file <f>.

COPYRIGHT

Copyright (C) 2009 HHMI Janelia Farm Research Campus.
Freely distributed under the GNU General Public License (GPLv3).
See the file COPYING that came with the source for details on redistribution conditions.

AUTHOR

Eric Nawrocki, Diana Kolbe, and Sean Eddy
HHMI Janelia Farm Research Campus
19700 Helix Drive
Ashburn VA 20147
http://selab.janelia.org/