Big O of the Genome
Most slides from Ben Langmead
- Sequence Alignment (Patter Matching)
- Sequence Assembly (Superstring Problem)
Crash Course in Genetics
DNA made up of A, G, C, T
In replication A=T, C=T
Text
Sequencing: 90-300 base pairs
Genes: 100s-1000s base pairs
Human genome (all 46 chromosomes): 3 billion base pairs
5'
5'
3'
3'
DNA Alignment
Given billions of snippets, figure out where they go in the reference genome
Try it out
Match abcd with abcasdfabcd...
Preprocessing
- Naive
- Knuth-Morris-Pratt
- Boyer-Moore
Naive exact matching
|T| - |P| + 1
4 - 3 = 1
1 + 1 = 2
DOG DOOR DOG DOOR
|P|(|T|-|P|+1)
O(|T||P|)
Polynomial
What heuristics can we use to optimize this?
Each iteration is O(|P|)
Preprocessing
- Naive
- Knuth-Morris-Pratt
- Boyer-Moore
Run preprocessing on pattern so you know what to do when it fails!
...WORLD... WORD
Pattern Matching
Genome as a haystack
- Very long
- Limited alphabet
- Reference stays the same
DNA Indexing: Suffix Trees!
DNA Fragment Assembly
Shortest Superstring Problem
Input: Strings s, 1...n
Output: A string s that contains all strings as substrings, such that the length of s is as small as possible
Some resources...
- Dr. Srinivas Aluru, Dr. Mark Borodovsky
- rosalind.info (similar to Project Euler)
Big O Genomics
By Jessica Rosenfield
Big O Genomics
- 540