Most slides from Ben Langmead
DNA made up of A, G, C, T
In replication A=T, C=T
Text
Sequencing: 90-300 base pairs
Genes: 100s-1000s base pairs
Human genome (all 46 chromosomes): 3 billion base pairs
5'
5'
3'
3'
Given billions of snippets, figure out where they go in the reference genome
|T| - |P| + 1
4 - 3 = 1
1 + 1 = 2
DOG DOOR DOG DOOR
|P|(|T|-|P|+1)
O(|T||P|)
Polynomial
What heuristics can we use to optimize this?
Each iteration is O(|P|)
Run preprocessing on pattern so you know what to do when it fails!
...WORLD... WORD
Shortest Superstring Problem
Input: Strings s, 1...n
Output: A string s that contains all strings as substrings, such that the length of s is as small as possible