daligner(1) long read aligner

SYNOPSIS

daligner [-vbAI][-kint(14)] [-wint(6)] [-hint(35)] [-tint] [-Mint] [-edouble(.70)] [-lint(1000)] [-sint(100)] [-Hint] [-mtrack]+ subject:db|dam target:db|dam ...

DESCRIPTION

Compare sequences in the trimmed subject block against those in the list of target blocks searching for local alignments involving at least -l base pairs (default 1000) or more, that have an average correlation rate of -e (default 70%). The local alignments found will be output in a sparse encoding where a trace point on the alignment is recorded every -s base pairs of the a-read (default 100bp). Reads are compared in both orientations and local alignments meeting the criteria are output to one of several created files described below. The -v option turns on a verbose reporting mode that gives statistics on each major step of the computation.

The options -k, -h, and -w control the initial filtration search for possible matches between reads. Specifically, our search code looks for a pair of diagonal bands of width 2^w (default 2^6 = 64) that contain a collection of exact matching k-mers (default 14) between the two reads, such that the total number of bases covered by the k-mer hits is h (default 35). k cannot be larger than 32 in the current implementation. If the -b option is set, then the daligner assumes the data has a strong compositional bias (e.g. >65% AT rich), and at the cost of a bit more time, dynamically adjusts k-mer sizes depending on compositional bias, so that the mers used have an effective specificity of 4^k.

If there are one or more interval tracks specified with the -m option, then the reads of the DB or DB's to which the mask applies are soft masked with the union of the intervals of all the interval tracks that apply, that is any k-mers that contain any bases in any of the masked intervals are ignored for the purposes of seeding a match. An interval track is a track, such as the "dust" track created by DBdust, that encodes a set of intervals over either the untrimmed or trimmed DB.

Invariably, some k-mers are significantly over-represented (e.g. homopolymer runs). These k-mers create an excessive number of matching k-mer pairs and left unaddressed would cause daligner to overflow the available physical memory. One way to deal with this is to explicitly set the -t parameter which suppresses the use of any k-mer that occurs more than t times in either the subject or target block. However, a better way to handle the situation is to let the program automatically select a value of t that meets a given memory usage limit specified (in Gb) by the -M parameter. By default daligner will use the amount of physical memory as the choice for -M. If you want to use less, say only 8Gb on a 24Gb HPC cluster node because you want to run 3 daligner jobs on the node, then specify -M8. Specifying -M0 basically indicates that you do not want daligner to self adjust k-mer suppression to fit within a given amount of memory.

For each subject, target pair of blocks, say X and Y, the program reports alignments where the a-read is in X and the b-read is in Y, and vice versa. However, if the -A option is set ("A" for "asymmetric") then just overlaps where the a-read is in X and the b-read is in Y are reported, and if X = Y, then it further reports only those overlaps where the a-read index is less than the b-read index. In either case, if the -I option is set ("I" for "identity") then when X = Y, overlaps between different portions of the same read will also be found and reported.

Each found alignment is recorded as -- a[ab,ae] x bo[bb,be] -- where a and b are the indices (in the trimmed DB) of the reads that overlap, o indicates whether the b-read is from the same or opposite strand, and [ab,ae] and [bb,be] are the intervals of a and bo, respectively, that align. The program places these alignment records in files whose name is of the form X.Y.[C|N]#.las where C indicates that the b-reads are complemented and N indicates they are not (both comparisons are performed) and # is the thread that detected and wrote out the collection of alignments contained in the file. That is the file X.Y.O#.las contains the alignments produced by thread # for which the a-read is from X and the b-read is from Y and in orientation O. The command daligner -A X Y produces 2*NTHREAD thread files X.Y.?.las and daligner X Y produces 4*NTHREAD files X.Y.?.las and Y.X.?.las (unless X=Y in which case only NTHREAD files, X.X.?.las, are produced).

By default, daligner compares all overlaps between reads in the database that are greater than the minimum cutoff set when the DB or DBs were split, typically 1 or 2 Kbp. However, the HGAP assembly pipeline only wants to correct large reads, say 8Kbp or over, and so needs only the overlaps where the a-read is one of the large reads. By setting the -H parameter to say N, one alters daligner so that it only reports overlaps where the a-read is over N base-pairs long.

While the default parameter settings are good for raw Pacbio data, daligner can be used for efficiently finding alignments in corrected reads or other less noisy reads. For example, for mapping applications against .dams, we run

daligner -k20 -h60 -e.85

and on corrected reads, we typically run

daligner -k25 -w5 -h60 -e.95 -s500

and at these settings it is very fast.