Fast, sustainable and secure bioinformatics with Rust-Bio and Rust-HTSlib

Text

Bioinformatics software should be:

  • efficient
  • robust
  • maintainable

Johannes Köster

University of Duisburg-Essen                     https://koesterlab.github.io

if __name__ == "__main__":
    print("Hello world! I am easy to use, "
          "but I can be slow and memory hungry "
          "and you might spend many hours with "
          "debugging runtime errors.")
#include <iostream.h>

main()
{
    cout << "Hello World! I am fast \
             but you might spend the \
             rest of your life with debugging. \
             " << endl;
    return 0;
}

A solution:

the Rust programming language

Do (so far) mainstream languages support this goal?

The problem

Ownership concept:

  1. Each value is owned by one variable.
  2. Only one owner at a time.
  3. When owner goes out of scope, the value will be dropped.

Borrowing concept:

  • Values can be borrowed via references.
  • There can be any number of immutable references at a time.
  • There can be only one mutable reference at a time.
  • When there is a mutable reference, no immutable references are allowed.
let mut s = String::from("hello");

let r1 = &mut s;
let r2 = &mut s;  // compiler error
let mut s = String::from("hello");

let r1 = &s;
let r2 = &s; 
let r3 = &mut s; // compiler error
fn main() {
    let reference_to_nothing = dangle();
}

fn dangle() -> &String {
    let s = String::from("hello");

    &s  // compiler error
}
fn main() {
    println!("Hello World! I am fast and robust. \
             I will shift your dev time from \
             debugging to compiling.");
}

Compile-time guarantees:

  • memory safety: no dangling pointers, no  segmentation faults
  • thread safety: no race conditions, compiler forces you to explicitly consider synchronization

Rust-Bio and Rust-HTSlib

File format support:

  • SAM/BAM/CRAM
  • VCF/BCF/Tabix
  • FASTA
  • FASTQ
  • BED
  • GFF
  • GTF

Data structures:

  • suffix array
  • BWT
  • FM(D) index
  • q-gram index
  • rank/select
  • Fenwick tree
  • pair-HMM

Algorithms:

  • local/semiglobal/global alignment
  • pairwise distances
  • sparse k-mer based alignment
  • Myers bit-parallel alignment
  • Ukkonen
  • Knuth-Morris-Pratt
  • Shift-And
  • Horspool
  • BNDM
  • BOM
  • PSSM-based motif search

Utilities:

  • numerically stable log-probabilities
  • PHRED-scale conversion
  • cumulative distribution functions
  • combinatorics
  • read/variant ringbuffers
// Import some modules
use bio::alphabets;
use bio::data_structures::suffix_array::suffix_array;
use bio::data_structures::bwt::{bwt, less, Occ};
use bio::data_structures::fmindex::{FMIndex, FMIndexable};
use bio::io::fastq;


// a given text
let text = b"ACAGCTCGATCGGTA";

// Create an FM-Index for the given text.

// instantiate an alphabet
let alphabet = alphabets::dna::iupac_alphabet();
// calculate a suffix array
let pos = suffix_array(text);
// calculate BWT
let bwt = bwt(text, &pos);
// calculate less and Occ
let less = less(&bwt, &alphabet);
let occ = Occ::new(&bwt, 3, &alphabet);
// setup FMIndex
let fmindex = FMIndex::new(&bwt, &less, &occ);


// Iterate over a FASTQ file, use the alphabet to validate read
// sequences and search for exact matches in the FM-Index.

// obtain reader or fail with error (via the unwrap method)
let reader = fastq::Reader::from_file("reads.fastq").unwrap();
for result in reader.records() {
    // obtain record or fail with error
    let record = result.unwrap();
    // obtain sequence
    let seq = record.seq();
    if alphabet.is_word(seq) {
        let interval = fmindex.backward_search(seq.iter());
        let positions = interval.occ(&pos);
    }
}

https://rust-bio.github.io