mirabait(1) a 'grep' like tool to select reads with kmers up to 256 bp


mirabait [options] {-b baitfile [-b ...] | -B file | -j joblibrary} {-p file_1 file_2 | -P file3}* [file4 ...]


mirabait selects reads from a read collection which are partly similar or equal to sequences defined as target baits. Similarity is defined by finding a user-adjustable number of common k-mers (sequences of k consecutive bases) which are the same in the bait sequences and the screened sequences to be selected, either in forward or forward/reverse complement direction. Adding a DUST-like repeat filter for repeats up 4 bases is optional.

When used on paired files, selects sequences where at least one mate matches.


Main options:

-b file
Load bait sequences from file (multiple -b allowed)
-B file
Load baits from kmer statistics file, not from sequence files. Only one -B allowed, cannot be combined with -b. (see -K for creating such a file)
-j job
Set options for predefined job from supplied MIRA library Currently available jobs:
rrna Bait rRNA sequences
-p file1 file2
Load paired sequences to search from file1 and file2 Files must contain same number of sequences, sequence names must be in same order. Multiple -p allowed, but must come before non-paired files.
-P file
Load paired sequences from file File must be interleaved: pairs must follow each other, non-pairs are not allowed. Multiple -p allowed, but must come before non-paired files.
-k int
kmer length of bait in bases (<=256, default=31)
-n int
If >0: minimum number of k-mer baits needed (default=1) If <=0: allowed number of missed kmers over sequence
Do not use kmers with microrepeats (DUST-like, see also -D)
-D int
Set length of microrepeats in kmers to discard from bait.
- int > 0 microrepeat len in percentage of kmer length. E.g.: -k 17 -D 67 --> 11.39 bases --> 12 bases.
- int < 0 microrepeat len in bases.
- int != 0 implies -d, int=0 turns DUST filter off.
Selects sequences that do not hit bait
Selects sequences that hit and do not hit bait (to different files)
No checking of reverse complement direction
Number of threads to use (default=0 -> up to 4 CPU cores)

Options for output definition:

Normally mirabait writes separate result files (named 'bait_match_*' and 'bait_miss_*') for each input to the current directory. For changing this behaviour and other relating to output, use these options:
No case change of sequence to denote bait hits
-l int
length of a line (FASTA only, default 0=unlimited)
-K file
Save kmer statistics to 'file' (see also -B)
-N name
Change the prefix 'bait' to <name> Has no effect if -o/-O is used and targets are not directories
-o <path>
Save sequences matching bait to path If path is a directory, write separate files into this directory. If not, combine all matching sequences from the input file(s) into a single file specified by the path.
-O <path>
Like -o, but for sequences not matching

Other options:

-T dir
Use 'dir' as directory for temporary files instead of current working directory.
-m integer
Memory to use for computing kmer statistics
0..100 = use percentage of free system memory
>100 = amount of MiB to use (e.g. 16384 for 16 GiB)
Default 75 (75% of free system memory).

Defining files types to load/save:

Normally mirabait recognises the file types according to the file extension (even when packed). In cases you need to force a certain file type because the file extension is non-standard, use the EMBOSS notation to force a type: <filetype>::<name_of_file>. E.g., to tell that "somefile.dat" is FASTQ, use: fastq::somefile.dat Recognised types are: caf, fasta, fastq, gbf, gbk, gbff, maf and phd.

MIRABAIT will write files in the same file type as the corresponding input files. Examples:

mirabait -b b.fasta file.fastq
mirabait -I -j rrna -p file_1.fastq file_2.fastq
mirabait -b b1.fasta -b b2.gbk file.fastq
mirabait -b fasta::baits.dat -p fastq::file_1.dat fastq::file_2.dat
mirabait -b b.fasta -p file_1.fastq file_2.fastq -P file3.fasta file4.caf
mirabait -I -b b.fasta -p file_1.fastq file_2.fastq -P file3.fasta file4.caf
mirabait -k 27 -n 10 -b b.fasta file.fastq
mirabait -b fasta::b.dat fastq::file.dat
mirabait -o /dev/shm/ -b b.fasta -p file_1.fastq file_2.fastq
mirabait -o ,/dev/shm/match/ -b b.fasta -p file_1.fastq file_2.fastq
mirabait -b human_genome.fasta -K HG_kmerstats.mhs.gz -p file1.fastq file2.fastq
mirabait -B HG_kmerstats.mhs.gz -p file1.fastq file2.fastq
mirabait -d -B HG_kmerstats.mhs.gz -p file1.fastq file2.fastq


To report bugs or ask for features, please use the ticketing system at:


Bastien Chevreux <[email protected]>

This manual page was written by Bastien Chevreux <[email protected]> but can be freely used for any documentation purpose.