Fast, sustainable and secure bioinformatics with Rust-Bio and Rust-HTSlib
Text
Bioinformatics software should be:
- efficient
- robust
- maintainable
Johannes Köster
University of Duisburg-Essen https://koesterlab.github.io
if __name__ == "__main__":
print("Hello world! I am easy to use, "
"but I can be slow and memory hungry "
"and you might spend many hours with "
"debugging runtime errors.")
#include <iostream.h>
main()
{
cout << "Hello World! I am fast \
but you might spend the \
rest of your life with debugging. \
" << endl;
return 0;
}
A solution:
the Rust programming language
Do (so far) mainstream languages support this goal?
The problem
Ownership concept:
- Each value is owned by one variable.
- Only one owner at a time.
- When owner goes out of scope, the value will be dropped.
Borrowing concept:
- Values can be borrowed via references.
- There can be any number of immutable references at a time.
- There can be only one mutable reference at a time.
- When there is a mutable reference, no immutable references are allowed.
let mut s = String::from("hello");
let r1 = &mut s;
let r2 = &mut s; // compiler error
let mut s = String::from("hello");
let r1 = &s;
let r2 = &s;
let r3 = &mut s; // compiler error
fn main() {
let reference_to_nothing = dangle();
}
fn dangle() -> &String {
let s = String::from("hello");
&s // compiler error
}
fn main() {
println!("Hello World! I am fast and robust. \
I will shift your dev time from \
debugging to compiling.");
}
Compile-time guarantees:
- memory safety: no dangling pointers, no segmentation faults
- thread safety: no race conditions, compiler forces you to explicitly consider synchronization
Rust-Bio and Rust-HTSlib
File format support:
- SAM/BAM/CRAM
- VCF/BCF/Tabix
- FASTA
- FASTQ
- BED
- GFF
- GTF
Data structures:
- suffix array
- BWT
- FM(D) index
- q-gram index
- rank/select
- Fenwick tree
- pair-HMM
Algorithms:
- local/semiglobal/global alignment
- pairwise distances
- sparse k-mer based alignment
- Myers bit-parallel alignment
- Ukkonen
- Knuth-Morris-Pratt
- Shift-And
- Horspool
- BNDM
- BOM
- PSSM-based motif search
Utilities:
- numerically stable log-probabilities
- PHRED-scale conversion
- cumulative distribution functions
- combinatorics
- read/variant ringbuffers
// Import some modules
use bio::alphabets;
use bio::data_structures::suffix_array::suffix_array;
use bio::data_structures::bwt::{bwt, less, Occ};
use bio::data_structures::fmindex::{FMIndex, FMIndexable};
use bio::io::fastq;
// a given text
let text = b"ACAGCTCGATCGGTA";
// Create an FM-Index for the given text.
// instantiate an alphabet
let alphabet = alphabets::dna::iupac_alphabet();
// calculate a suffix array
let pos = suffix_array(text);
// calculate BWT
let bwt = bwt(text, &pos);
// calculate less and Occ
let less = less(&bwt, &alphabet);
let occ = Occ::new(&bwt, 3, &alphabet);
// setup FMIndex
let fmindex = FMIndex::new(&bwt, &less, &occ);
// Iterate over a FASTQ file, use the alphabet to validate read
// sequences and search for exact matches in the FM-Index.
// obtain reader or fail with error (via the unwrap method)
let reader = fastq::Reader::from_file("reads.fastq").unwrap();
for result in reader.records() {
// obtain record or fail with error
let record = result.unwrap();
// obtain sequence
let seq = record.seq();
if alphabet.is_word(seq) {
let interval = fmindex.backward_search(seq.iter());
let positions = interval.occ(&pos);
}
}
https://rust-bio.github.io
Poster Rust-Bio 2018
By Johannes Köster
Poster Rust-Bio 2018
- 1,724