Big O of the Genome

Most slides from Ben Langmead

  • Sequence Alignment (Patter Matching)
  • Sequence Assembly (Superstring Problem)

Crash Course in Genetics

DNA made up of A, G, C, T

In replication A=T, C=T

Text

Sequencing: 90-300 base pairs

Genes: 100s-1000s base pairs

Human genome (all 46 chromosomes): 3 billion base pairs

5'

5'

3'

3'

DNA Alignment

Given billions of snippets, figure out where they go in the reference genome

Try it out

Match abcd with abcasdfabcd...

Preprocessing

  • Naive
  • Knuth-Morris-Pratt
  • Boyer-Moore

Naive exact matching

|T| - |P| + 1

4 - 3 = 1

1 + 1 = 2

DOG
DOOR

 DOG
DOOR

|P|(|T|-|P|+1)

 O(|T||P|)

Polynomial

What heuristics can we use to optimize this?

Each iteration is O(|P|)

Preprocessing

  • Naive
  • Knuth-Morris-Pratt
  • Boyer-Moore

Run preprocessing on pattern so you know what to do when it fails!

...WORLD...
   WORD

Pattern Matching

Genome as a haystack

  • Very long
  • Limited alphabet
  • Reference stays the same

DNA Indexing: Suffix Trees!

DNA Fragment Assembly

Shortest Superstring Problem

Input: Strings s,  1...n

Output: A string s that contains all strings as substrings, such that the length of s is as small as possible

Some resources...

  • Dr. Srinivas Aluru, Dr. Mark Borodovsky
  • rosalind.info (similar to Project Euler)

Big O Genomics

By Jessica Rosenfield