Rust-Bio: Bioinformatics with Rust
Johannes Köster
July 2016
Bioinformatics
Bioinformatics
combines computer science, statistics, mathematics to interpret biological data
Involves:
Mathematical or statistical modelling, machine learning, ...
Almost always:
deal with biological sequences
DNA
encodes the genome, the "blueprint" of any organism
- DNA: chain of nucleotides
- Nucleotide: molecule made of phosphate, sugar and base (ACGT)
- sequence of bases is blueprint of proteins
Text over the alphabet ACGT
Proteins
building blocks of cells
- Protein: chain of amino acids
- Amino acid: molecule made of carbon, hydrogen, oxygen, nitrogen (about 500 known, 20 used in organisms)
Text over the alphabet ARNDCEQGHILKMFPSTWYV
Algorithms on sequences
Examples:
- find mutations
- investigate homologies
- identify species
Techniques:
- pattern/string matching
- succinct data structures
- indexes
Big data:
- size of human genome: >3 billion bases
- single sample: 5 to 300GB of data
Rust-Bio
Bioinformatics Software
Wanted:
- fast implementations (big data)
- bug-free* implementations (personalized medicine)
Reality:
- implementors are people with fixed-term contracts (PhD students, postdocs)
- limited time: scripting languages
- bugs and unmaintained projects all around
Rust for Bioinformatics
Idea:
- Let the Rust compiler do the work.
- Shift development time from debugging to compiling.
- Rely on lints and ownership model to solve code-quality and bug-issues.
Rust-Bio
common Bioinformatics algorithms and data structures as a library
Features:
- pattern matching algorithms
- alphabets
- pairwise alignments
- suffix arrays and fulltext indices
- a q-gram index
- a rank/select data structure,
- readers and writers for common file formats
- log probabilities
Rust-Bio by Example
Sequence alignment
ACGGATGCGATCGGAGCTGAGCTGCGATCGATGCGATCG ||| ||| |||||||| CGA---GAGTTGAGCTGC
Given:
sequence A and B
Find:
combination of substitutions, deletions and insertions with maximal score / minimal cost that transforms B into A
Sequence alignment
- alignment graph
- edges represent operations
- find path from top-left to bottom-right with minimal cost
Sequence alignment
Variants:
- edge costs: unit costs, affine gap costs, biologically motivated substitution costs
- semi-global alignment (top to bottom path)
- local alignment (no constraint)
Sequence alignment
Implementation:
- dynamic programming
- store only two columns
- traceback with bit-encoding
Usage
// two texts
let x = b"ACCGTGGAT";
let y = b"AAAAACCGTTGAT";
// scoring
let score = |a: u8, b: u8| if a == b {1i32} else {-1i32};
let gap_open = -5;
let gap_extend = -1;
// init aligner
let mut aligner = Aligner::with_capacity(x.len(), y.len(), gap_open, gap_extend, &score);
// perform alignment
let alignment = aligner.semiglobal(x, y);
// results
alignment.pretty()
// output:
// AAAAACCGTTGAT
// ||||| |||
// ----ACCGTGGAT
alignment.operations
// output:
// [Match, Match, Match, Match, Match, Subst, Match, Match, Match]
alignment.cigar()
// output:
// 5=1X3=
Memory management
pub struct Aligner<'a, F>
where F: 'a + Fn(u8, u8) -> i32
{
// dynamic programming matrix columns
S: [Vec<i32>; 2],
I: [Vec<i32>; 2],
D: [Vec<i32>; 2],
// traceback matrix (as bitvectors)
S_traceback: Traceback,
I_traceback: Traceback,
D_traceback: Traceback,
// scoring
gap_open: i32,
gap_extend: i32,
// function for custom substitution costs
score: &'a F,
}
Avoid reallocation:
aligner can be reused multiple times
Alignment macro
macro_rules! align {
(
$aligner:ident, $x:ident, $y:ident, $state:ident,
$init:block, $inner:block, $outer:block, $ret:block
) => (
{
let mut $state = AlignmentState { ... };
// iterate over columns
while $state.i <= $state.n {
// init block
$init
// read next y symbol
let b = $y[$state.i - 1];
$state.j = 1;
// iterate over rows
while $state.j <= $state.m {
// read next x symbol
let a = $x[$state.j - 1];
// calculate scores for substitution, insertion and deletion
...
// inner block
$inner
// update dynamic programming matrix with best score
$aligner.S[$state.col][$state.j] = $state.score;
$state.j += 1;
}
// outer block
$outer
$state.i += 1;
}
// return code
$ret
}
);
}
Global alignment
align!(
self,
x,
y,
state,
{
// init block (before each column)
self.S[state.col][0] = i32::MIN - self.gap_open;
self.I[state.col][0] = i32::MIN - self.gap_open;
self.D[state.col][0] = self.gap_open + (state.i as i32 - 1) * self.gap_extend;
self.S_traceback.del(state.i, 0);
self.D_traceback.del(state.i, 0);
self.I_traceback.del(state.i, 0);
},
{
// inner block (after each row)
},
{
// outer block (after each column)
},
{
self.alignment(state.n, state.m, x, y, state.score)
}
)
Semi-global alignment
align!(
self,
x,
y,
state,
{
// init block (before each column)
self.S[state.col][0] = 0;
},
{
// inner block (after each row)
},
{
// outer block (after each column)
if state.score > state.best || state.best_j != state.m {
state.best = state.score;
state.best_i = state.i;
state.best_j = state.m;
}
},
{
self.alignment(state.best_i, state.best_j, x, y, state.best)
}
)
Local alignment
align!(
self,
x,
y,
state,
{
// init block (before each column)
self.S[state.col][0] = 0;
},
{
// inner block (after each row)
if state.score < 0 {
self.S_traceback.start(state.i, state.j);
state.score = 0;
} else if state.score > state.best {
state.best = state.score;
state.best_i = state.i;
state.best_j = state.j;
}
},
{
// outer block (after each column)
},
{
self.alignment(state.best_i, state.best_j, x, y, state.best)
}
)
Example general purpose data structures
Bit encoding
use bio::data_structures::bitenc::BitEnc;
// two bit encoding
let mut bitenc = BitEnc::new(2);
// push values
bitenc.push(0);
bitenc.push(2);
bitenc.push(1);
// random access
bitenc.get(1)
// output:
// Some(2u8)
// iterate
bitenc.iter().collect();
// output:
// [0u8, 2u8, 1u8]
Mostly small integers
use bio::data_structures::smallints::SmallInts;
// init
let mut smallints: SmallInts<u8, u64> = SmallInts::new();
// add values
smallints.push(3);
smallints.push(4);
smallints.push(255);
smallints.push(305093);
// random access
smallints.get(0)
// output:
// Some(3u64)
// iteration
smallints.iter().collect()
// output:
// [3u64, 4u64, 255u64, 305093u64]
Mostly small integers
pub struct SmallInts<S: Integer + Bounded + NumCast + Copy, B: Integer + NumCast + Copy> {
smallints: Vec<S>,
bigints: BTreeMap<usize, B>,
}
impl<...> SmallInts<S, B> {
fn real_value(&self, i: usize, v: S) -> Option<B> {
if v < S::max_value() {
cast(v)
} else {
self.bigints.get(&i).cloned()
}
}
}
Rank/select
extern crate bit_vec;
use bio::data_structures::rank_select::RankSelect;
use bit_vec::BitVec;
// init
let mut bits = BitVec::from_elem(64, false);
bits.set(5, true);
bits.set(32, true);
let rs = RankSelect::new(bits, 1);
// query rank of bit
rs.rank(5)
// output:
// Some(1)
// select first bit with given rank
rs.select(2)
// output:
// Some(32)
0000100000000000000000000000000100 0000111111111111111111111111111222
bits:
ranks:
Outlook
Types:
- now: use type aliases to std where possible (e.g. texts are &[u8] or corresponding iterators)
- future: use newtypes where distinction makes sense (e.g. suffix array, logprob)
Parallelization:
- now: no internal parallelization but implementing Send and Sync
- future: maybe use rayon where possible, let the user decide by modifying a global switch
Acknowledgements
Contributors:
Christopher Schröder
Peer Aramillo Irizar
Fedor Gusev
Vadim Nazarov
Brad Chapman
Florian Gilcher
Erik Clarke
Rizky Luthfianto
Adam Perry
Taylor Cramer
Andre Bogus
Martin Larralde
Philipp Angerer
Pierre Marijon
Franklin Delehelle
Crates:
itertools
csv
nalgebra
approx
serde
Rust-Bio
By Johannes Köster
Rust-Bio
Invited talk about Rust-Bio at the Mozilla Rust Bay Area Meeting.
- 3,223