BIOSC 1540: L08 (Read mapping)

Computational Biology

(BIOSC 1540)

Sep 19, 2024

Lecture 08:
Read mapping

Announcements

A03 is due tonight by 11:59 pm

After today, you should be able to

1. Describe the challenges of aligning short reads to a large reference genome.
2. Compare read alignment algorithms, including hash-based and suffix tree-based approaches.
3. Explain the basic principles of the Burrows-Wheeler Transform (BWT) for sequence alignment.

We are dealing with enormous datasets

Reference genome sizes

Homo sapiens: 3,200,000,000 bp
Mus musculus: 2,700,000,000 bp
Drosophila melanogaster: 140,000,000 bp
Saccharomyces cerevisiae: 12,000,000 bp

(~3.2 GB if using u8)

RNA-seq data

Illumina RNA-seq is around 120 GB

Contextualization

The best movie ever is only 1.2 GB

Most computers have 8 - 12 GB of RAM

The Trade-off: Fast vs. Precise

Performance considerations

Balancing speed and accuracy
Efficient alignment for downstream analyses
Resource management (CPU, memory)

After today, you should be able to

1. Describe the challenges of aligning short reads to a large reference genome.
2. Compare read alignment algorithms, including hash-based and suffix tree-based approaches.
3. Explain the basic principles of the Burrows-Wheeler Transform (BWT) for sequence alignment.

A spectrum of alignment strategies

Hash tables

Mid 2000s

Late 2000s

Suffix arrays/trees

Burrows-Wheeler transforms

E.g., SOAP and MAQ

E.g., Bowtie2, BWA, STAR

Hash tables link a key to a value

Keys represent a "label" we can use to get information

Example: Contacts

Name

A "hash function" determines where to find their number

Number

Hash functions convert labels to table indices

h(k) = \text{len} (k)

Index: We take the key, count how many characters are in it

Note: This is a bad hashing function since "Alex" and "John" would result in the same index

Example

len("James") = 5

James

883-234-1236

Hashing our reference genome seeds our hash table with k-mer locations

5-mers
TACGT, ACGTA, CGTAC, GTACG, . . .

Reference genome

0

10

TACGTACGATAGTCGACTAGCATGCATGCTACGTGCTAGCACGTATGCATGCATGCATGCC

20

30

40

50

60

We hash our k-mer, and add the starting index where that k-mer occurs in our reference genome

TACGT

[1, 40]

h(k)

ACGTA

[0, 29]

k-mer location in genome

k-mer

Hashing our RNA-seq data provides quick lookups of our reference genome

Query a k-mer read to get indices that of possible reference genome locations

Reference genome

0

10

TACGTACGATAGTCGACTAGCATGCATGCTACGTGCTAGCACGTATGCATGCATGCATGCC

20

30

40

50

60

Hash table

TACGT

[0, 29]

ACGTA

[1, 40]

k-mer location in genome

k-mer

h(k)

Seed-and-extend in hash-based alignment

Seed

Extend

Read: ATCGATTGCA

k-mers (k=3)
ATC, TCG, CGA, GAT, ATT, TTG, TGC, GCA

Use hash table for rapid lookup of potential matches quickly

Start from seed match and grow in both directions with reference genome

CCGTATCGATTGCAGATG

GAT

[7, 14]

h(k)

Check to see if we can align the read to reference

Hash-Based Alignment:
Divide and Conquer

A "DNA dictionary" with quick lookup and direct access to potential matches

Pros

Easily parallelizable
Flexible for allowing mismatches
Conceptually simple

Cons

Large memory footprint for index
Can be slower for very large genomes

Suffix trees represent all suffixes of a given string

NA

NA$

$

4

2

0

1

3

5

A

$

NA

$

NA$

Root node
(Start here)

BANANA

$

A suffix tree is used to find starting index of suffix

Nowhere.

Edge

Node

Leaf node

Split point

Suffix start index

Next part of suffix

Where does start?

AANA

Index 2.

Example: Where does
start?

NANA$

Note: We use $ to represent the end of a string

Suffix arrays are memory-efficient alternatives to trees

BANANA

$

Requires less memory, but is also less powerful

1. Create all suffixes

2. Sort lexicographically

BANANA

$

ANANA

$

NANA

$

NA

$

ANA

$

A

$

3. Store starting indices in original string

6

5

3

1

0

4

2

Symbols come before letters for sorting

BANANA

$

ANANA

$

NANA

$

NA

$

ANA

$

A

$

After today, you should be able to

1. Describe the challenges of aligning short reads to a large reference genome.
2. Compare read alignment algorithms, including hash-based and suffix tree-based approaches.
3. Explain the basic principles of the Burrows-Wheeler Transform (BWT) for sequence alignment.

Compression reduces the amount of data we have to store

"Alex keeps talking and talking and talking and talking and eventually stops."

Suppose we need to store this string:

How could we store this string and save space?

"talking and talking and talking and talking and"

"talking and" 4

=

"Alex keeps talking and talking and talking and talking and eventually stops."

"Alex keeps" + "talking and" 4 +"eventually stops."

Run-length encoding

Not all strings have repeats

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec iaculis risus vulputate dui condimentum congue. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas.

Can you find any repeats?

How can we force repeats?

Sorting the letters does!

.aaaaaaaaaaaabbcccccccccddddddeeeeeeeeeeeeeeeeeeeeeeefggghiiiiiiiiiiiiiiiillllllllmmmmmmmmnnnnnnnnnnoooooooopppppqqrrrrrrrssssssssssssssssstttttttttttttttttttuuuuuuuuuuuuuuuv

a12b2c9d6e23f1g3h1i16l8m8n10o8p5q2r7s17t19u15v1

Run-length encoding

Sorting lexicographically forces repeats, but loses original data

The Burrows-Wheeler Transform (BWT) is a way to sort our strings without losing the original data

(And also search through it!)

Developed by Michael Burrows and David Wheeler in 1994

Basic concept of BWT

Append a unique end-of-string (EOS) marker to the input string.
Generate all rotations of the string.
Sort these rotations lexicographically.
Extract the last column of the sorted matrix as the BWT output.

BANANA

$

A

$

BANAN

NA

$

BANA

$

BANANA

ANA

$

BAN

NANA

$

BA

ANANA

$

B

BANANA

$

A

$

BANAN

NA

$

BANA

$

BANANA

ANA

$

BAN

NANA

$

BA

ANANA

$

B

ANNB$AA

BANANA

First column is more compressible, but we lose context and reversibility

(We can also get first column by sorting the output)

N

A

N

B

$

A

$

B

N

A

$

B

N

A

N

B

$

A

AN

A$

$B

BA

NA

AN

A$

$B

BA

NA

N

A

N

B

$

A

ANA

A$B

$BA

BAN

NAN

NA$

ANA

A$B

$BA

BAN

NAN

NA$

N

A

N

B

$

A

BWT output is reversible!

ANAN

ANA$

A$BA

$BAN

BANA

NANA

NA$B

BANANA$

NANA$BA

NA$BANA

A$BANAN

$BANANA

ANANA$B

ANA$BAN

Write BWT output "vertically"
Sort each row starting from the left-most character
Append the same BWT output
Repeat until finished (length of rows equal BWT output length)

2

3

2

3

2

3

2

...

Enhancing BWT for Rapid Searching

ABRACADABRA$

ABRACADABR

$ABRACADAB

BRA$ABRACA

BRACADABRA

CADABRA$AB

DABRA$ABRA

RA$ABRACAD

RACADABRA$

ADABRA$ABR

ABRA$ABRAC

A$ABRACADA

ACADABRA$A

$

A₀

A₁

A₂

A₃

A₄

B₀

B₁

C₀

D₀

R₁

R₀

A₀

R₀

D₀

$

R₁

C₀

A₁

A₂

A₃

A₄

B₁

B₀

ABRACADABR

$ABRACADAB

BRA$ABRACA

BRACADABRA

CADABRA$AB

DABRA$ABRA

RA$ABRACAD

RACADABRA$

ADABRA$ABR

ABRA$ABRAC

A$ABRACADA

ACADABRA$A

$

A

B

C

D

R

A

R

D

$

R

C

A

B

The backward search algorithm efficiently finds occurrences of a pattern in a text using the LF-mapping

BWT matrix

Number

Number characters with the number of times they have appeard

F-column

L-column

ABRACADABRA$

ABRA

Suppose I want to find where

is located

1. Find rows with last character in search string (e.g., A) in F-column

2. Note which rows has the next letter (e.g., R) in L-column

3. Work backwards in search string until the first letter

A₀

R₀

D₀

$

R₁

C₀

A₁

A₂

A₃

A₄

B₁

B₀

$

A₀

A₁

A₂

A₃

A₄

B₀

B₁

C₀

D₀

R₁

R₀

A

R

ABRACADABR

$ABRACADAB

BRA$ABRACA

BRACADABRA

CADABRA$AB

DABRA$ABRA

RA$ABRACAD

RACADABRA$

ADABRA$ABR

ABRA$ABRAC

A$ABRACADA

ACADABRA$A

A₀

R₀

D₀

$

R₁

C₀

A₁

A₂

A₃

A₄

B₁

B₀

$

A₀

A₁

A₂

A₃

A₄

B₀

B₁

C₀

D₀

R₁

R₀

R

B

ABRACADABR

$ABRACADAB

BRA$ABRACA

BRACADABRA

CADABRA$AB

DABRA$ABRA

RA$ABRACAD

RACADABRA$

ADABRA$ABR

ABRA$ABRAC

A$ABRACADA

ACADABRA$A

A₀

R₀

D₀

$

R₁

C₀

A₁

A₂

A₃

A₄

B₁

B₀

$

A₀

A₁

A₂

A₃

A₄

B₀

B₁

C₀

D₀

R₁

R₀

B

A

ABRACADABR

$ABRACADAB

BRA$ABRACA

BRACADABRA

CADABRA$AB

DABRA$ABRA

RA$ABRACAD

RACADABRA$

ADABRA$ABR

ABRA$ABRAC

A$ABRACADA

ACADABRA$A

Backward search enables efficient searching using only first and last columns of BWT

BWT practice

Given the string "mississippi$", complete the following tasks:

Construct the Burrows-Wheeler Transform (BWT) of the string.
Use the LF-mapping to find the number and positions of occurrences of the following patterns in the original string:
- a) "si"
- b) "iss"
- c) "pp"

Announcements

After today, you should be able to

We are dealing with enormous datasets

The Trade-off: Fast vs. Precise

After today, you should be able to

A spectrum of alignment strategies

Hash tables

Suffix arrays/trees

Burrows-Wheeler transforms

Hash tables link a key to a value

Hash functions convert labels to table indices

Hashing our reference genome seeds our hash table with k-mer locations

Hashing our RNA-seq data provides quick lookups of our reference genome

Seed-and-extend in hash-based alignment

Hash-Based Alignment: Divide and Conquer

Suffix trees represent all suffixes of a given string

Suffix arrays are memory-efficient alternatives to trees

After today, you should be able to

Compression reduces the amount of data we have to store

Not all strings have repeats

Sorting lexicographically forces repeats, but loses original data

Basic concept of BWT

BWT output is reversible!

Enhancing BWT for Rapid Searching

Backward search enables efficient searching using only first and last columns of BWT

BWT practice

Before the next class, you should

Hash-Based Alignment:
Divide and Conquer