SYNOPSIS
anfo-tool [ option | pattern ... ]DESCRIPTION
anfo-tool is used to filter, process and convert the files created by anfo. Every pattern on the command line is wildcard expanded, then for every input file (or the standard input, if no pattern is given), anfo-tool builds a chain of input filters, it then merges these input streams in one of several ways, splits the result up into multiple output streams, each of which can have a chain of output filter applied.
OPTIONS
General Options
These options apply globally and modify the behavior of the whole program. They can be placed anywhere in the command line.
- -V, --version
-
Print version number and exit.
- -q, --quiet
-
Suppress all output except fatal error messages.
- -v, --verbose
-
Produce more output, including progress indicators for most operations.
- --debug
-
Produce debugging output in addition to progress information.
- -n, --dry-run
-
Parse command line, optionally print a description of the intended
operations, then exit.
- --vmem X
-
Limit virtual memory to
X
megabytes. If memory runs out,
anfo-tool
tries to free up memory by forgetting about big files, e.g. genomes.
Use this option to avoid swapping or out-of-memory conditions when
operations involve big or multiple genomes.
Setting Parameters
A parameter can be set multiple times on the command line and will overwrite previous settings. Any filter option that needs a parameter picks up the last definition that appeared before the filter option.
- --set-slope S
-
Set the slope parameter to
S.
The
slope
is used together with the
intercept
where filters apply to alignment scores; alignments scoring no worse
than
slope * (length - intercept)
are considered good. The default is 7.5.
- --set-intercept L
-
Set the intercept parameter to
L.
The
intercept
is used together with the
slope
where filters apply to alignment scores; alignments scoring no worse
than
slope * (length - intercept)
are considered good. The default is 20.
- --set-context C
-
Set the context parameter to
C.
The context is the number of surrounding bases of the reference included
when printing alignments in text form. The default is 0.
- --set-genome G
-
Set the genome parameter to
G.
Many filters will only consider the best alignments to this specific
genome if it is set. If no genome is set, the globally best alignment
is used.
- --clear-genome
-
Clear the genome parameter. Filters apply to the globally best
alignment afterwards.
Filter Options
Filters can be applied before merging the inputs or after splitting the back up.
- -s, --sort-pos=n
-
Sort by alignment position while buffering no more than
n
MiB in memory. If a genome is set, alignments to that genome are used.
- -S, --sort-name=n
-
Sort by read name while buffering no more than
n
MiB in memory.
- -l, --filter-length=L
-
Retain alignments only for reads of at least
L
bases length. The reads themselves are kept.
- -f, --filter-score
-
Retain alignments only if their score is good enough.
Usesslopeandintercept.
- --filter-mapq=Q
-
Remove alignments with mapping quality below
Q.
- -h, --filter-hit=SEQ
-
Keep only reads that have a hit to a sequence named
SEQ.
If
SEQ
is empty, reads are kept if they have any hit. If the
genome
parameter is set, only hits to that genome count.
- --delete-hit=SEQ
-
Delete alignments to
SEQ.
If
SEQ
is empty, all alignments are deleted. If the
genome
parameter is set, only alignments to that genome are deleted.
- --filter-qual=Q
-
Mask out bases with quality below
Q.
Such a base is replaced by the
N
ambiguity code.
- --multiplicity=N
-
Keep only reads of molecules that have been sequenced at least
N
times. Reads are considered to come from the same original molecule if
their aligned coordinates are identical.
- --subsample=F
-
Subsample a fraction
F
of the results. Every read is independently and randomly choosen to be
kept or not.
- --inside-regions=FILE
-
Read a list of regions from
FILE,
then keep only alignments that overlap an annotated region.
- --outside-regions=FILE
-
Read a list of regions from
FILE,
then keep only alignments that do not overlap an annotated region.
Special Filters
- -d, --rmdup=Q
-
Remove PCR duplicates, clamp quality scores to
Q.
Two reads are considered to be duplicates, if their aligned coordinates
are identical. If a
genome
is set, the best alignment to that genome is
used, else the globally best alignment. Both alignments must be good,
as determined by
slopeandintercept.
For a set of duplicates, a consensus is called, generally increasing the
quality scores. If a resulting quality score exceeds
Q,
it is set to
Q.
This filter requires the input to be sorted by alignment coordinate on
the selected
genome.
--duct-tape=NAME Duct-tape overlapping alignments into contigs and call a consensus for them. If a genome is set, alignments to that genome are used, else the globally best alignments. This filter requires input to be sorted by alignment coordinate on the genome. Output is a set of contigs, every position gets assigned a consensus base, a quality score and likelihoods for every possible diallele. (It is called duct-taping because it kind of looks like an assembly, but is not nearly as solid.)
- --edit-header=ED
-
Invoke the editor
ED
on the text representaion of the stream's header. This can be used to
clean up header that have accumulated too much cruft.
Merging Filters
Exactly one merging filter should be given on the command line, all filter options occuring before that are part of the input filter chains, all further filters become output chains. If no merging filter is given, --concat is assumed, and all filters are input filters.
- -c, --concat
-
Concatenate all input streams in the order they appear on the command
line.
- -m, --merge
-
Merge sorted input streams, producing a sorted result. All inputs must
be sorted in the same way.
- -j, --join
-
Join input streams and retain the single best hits to each genome.
Every input stream must contain a record for every read, reads are
buffered in memory until all of their hits are collected. This way,
joining works well if all inputs are nearly in the same order. If reads
are missing from some streams, joining them will waste memory.
- --mega-merge
-
Merge many streams such as those produced by running
anfo-sge.
Streams that operated on the same reads are joined, then everything is
merged.
Output Options
If an output option is given on the command line, the current output filter chain is ended and a new one is started. If no output option is given, a textual representation of the final stream is written to stdout. All output options accept - to write to stdout.
- -o, --output FILE
-
Write native binary stream (a compressed protobuf message) to
FILE.
Writing a binary stream and reading it back in is lossless.
- --output-text FILE
-
Write protobuf text stream to
FILE.
If the necessary genomes are available, a textual representation of the
alignments is included. If the
context
parameter is set, that many additional bases of the reference upstream
and downstream from the alignment are included.
- --output-sam=FILE
-
Write alignments in SAM format to
FILE.
- --output-glz FILE
-
Write contigs in GLZ 0.9 format to
FILE.
Generating GLZ only works after application of
--duct-tape,
every contigs becomes a GLZ record.
- --output-3aln FILE
-
Write contigs in a table based format to
FILE.
The format is still subject to change, see the source code for detailed
documentation.
- --output-fasta FILE
-
Write alignments(!) in FastA format to
FILE.
Alignments are writte as pair of reference and query sequence, aligned
coordinates are indicated in the description of the query sequence. If
the
context
parameter is set, that many additional bases of the reference upstream
and downstream from the alignment are included. This format is not
suggested for any serious use, it exists to support legacy applications.
- --output-fastq FILE
-
Write sequences(!) in FastQ format to
FILE.
Writing FastQ effectively reconstitutes the input to
ANFO
if no filtering was done on the results.
- --output-table FILE
-
Write per-alignment statistics to
FILE.
The file has three colums:Âsequence length, alignment score, difference
to next best alignment. It is mainly useful to analyze/visualize the
distribution of alignment scores.
- --stats FILE
-
Write simple statistics to
FILE.
This results in some simple summary statistics of a whole stream: number
of aligned sequences, average length, GC content.
ENVIRONMENT
- ANFO_PATH
-
Colon separated list of directories searched for genome and index files.
- ANFO_TEMP
-
Temporary space used for sorting of large files.
FILES
/etc/popt- The system wide configuration file for popt(3). anfo-tool identifies itself as "anfo-tool" to popt.
~/.popt
- Per user configuration file for popt(3).
BUGS
The command line of this tools is way too complicated and its semantics are counterintuitive. Using anfo-tool is probably best avoided in most cases, the guile bindings should provide a much more scalable and easier to understand interface.