Loading
aalexmmaldonado
This is a live streamed presentation. You will automatically follow the presenter and see the slide they're currently on.
Computational Biology
(BIOSC 1540)
Feb 13, 2025
Lecture 06B
Read mapping
Methodology
Assignments
Quizzes
CBytes
ATP until the next reward: 653
Cells will transcribe these genes into mRNA (i.e., transcripts)
We collect, convert to complementary DNA, and then amplify
Suppose we have the following three coding (i.e., genes) and non-coding regions with introns and exons
DNA
Fragmentation and sequencing
FASTQ reads
Each RNA-seq read represents a small fragment of a transcript
Our purified sample
By mapping reads to a reference genome or transcriptome, we can:
DNA
Reads
Read mapping determines where in the genome did these reads originate from.
DNA
Reads
Unlike DNA sequencing, RNA sequencing includes spliced transcripts.
Key problem: Reads from mRNA span exon-exon junctions, but the genome contains introns.
Solution: Transcriptomic aligners must allow for gapped alignments that bridge exon-exon junctions.
Activity
The reads below were built by taking random words, slicing three letters, and then concatenating without spaces (all lowercase)
In groups, please work together to determine which words were used for your read
Be prepared to explain how you approached the problem
Danny loved spotting shapes in the clouds and had an entire journal filled with sketches of dragons, castles, and sailing ships. One day, he noticed a small cloud following him, shifting to match whatever he imagined. He tested it by thinking of a giant ice cream cone; sure enough, it transformed before his eyes. Delighted, he ran home, wondering how much fun he could have with a personal cloud. His only concern was making sure it didn’t rain inside his room.
Let's consider this short story as our genome containing coding and non-coding regions
0. danclodrashi
"Danny clouds dragons ships"
1. entmagcretra
2. spomatwoncon
3. notgiasaibed
4. lovdrathirai
?
?
?
?
What about doing 30?
We need to map millions of reads to our genome, so how could we approach this computationally?
danclodrashi
lovshaentjou
spofilskecas
saionenotsma
clofolshimat
whaimatestthi
giaicecrecon
surenotratrig
befeyedelran
homwonmucfun
cloonlconmak
surdidraiins
danlovsposke
shajoufildra
cassaishiday
notclofolshi
matwhaimates
thigiaicecre
consurenotra
couhavwitper
rigbefeyedel
ranhomwonmuc
funcouhavwit
percloonlcon
maksurdidrai
insbeddanlov
sposhacloent
joufilskedra
casnotsmaclo
folshimathwa
K-mer indexing
Danny loved spotting shapes in the clouds and had an entire journal filled with sketches of dragons, castles, and sailing ships [...]
We can chunk our story into k-mers and store where in the story they occur
"dan": [0]
"ann": [1]
"nny": [2]"clo": [35, 153, 382]
"lou": [36, 154, 383]
"oud": [37, 155, 384]
danny
clouds, cloud, cloud
"ing": [17, 116, 165, 178, 233, 335, 412]spotting, sailing, following, shifting, thinking, wondering, making
We can store these k-mers and indices and then use these to find potential sources
danclodrashi
Danny loved spotting shapes in the clouds and had an entire journal filled with sketches of dragons, castles, and sailing ships [...]
dan, clo, dra, shi
Read
Genome
Query k-mers
Check if k-mers are in our genome and the starting index of that k-mer
"dan": [0]
"clo": [35, 153, 382]
"dra": [92]
"shi": [120, 173]clouds, cloud, cloud
danny
dragon
ships, shifting
This data structure is called a hash table (i.e., dictionary in Python)
Keys represent a "label" we can use to get information
Example: Phone book / Contacts list
Name
A "hash function" determines where to find their number in our computer's memory
Number
5-mers
TACGT, ACGTA, CGTAC, GTACG, . . .
Reference genome
0
10
TACGTACGATAGTCGACTAGCATGCATGCTACGTGCTAGCACGTATGCATGCATGCATGCC
20
30
40
50
60
We hash our k-mer, and add the starting index where that k-mer occurs in our reference genome
TACGT
[1, 40]
h(k)
ACGTA
[0, 29]
k-mer location in genome
k-mer
Query a k-mer read to get indices that of possible reference genome locations
Reference genome
Hash table
TACGT
[0, 29]
ACGTA
[1, 40]
k-mer location in genome
k-mer
h(k)
0
10
TACGTACGATAGTCGACTAGCATGCATGCTACGTGCTAGCACGTATGCATGCATGCA
20
30
40
50
A "DNA dictionary" with quick lookup and direct access to potential matches
Pros
Cons
"dan": [0]
"clo": [35, 153, 382]
"dra": [92]
"shi": [120, 173]NA
NA$
$
4
2
0
1
3
5
A
$
NA
$
NA$
Root node
(Start here)
BANANA
$
A suffix tree is used to find starting index of suffix
Nowhere.
Edge
Node
Leaf node
Split point
Suffix start index
Next part of suffix
Where does start?
AANA
Index 2.
Example: Where does
start?
NANA$
Note: We use $ to represent the end of a string
BANANA
$
Requires less memory, but is also less powerful
1. Create all suffixes
2. Sort lexicographically
BANANA
$
ANANA
$
NANA
$
NA
$
ANA
$
A
$
$
3. Store starting indices in original string
6
5
3
1
0
4
2
$ comes before letters for sorting
String index
BANANA
$
ANANA
$
NANA
$
NA
$
ANA
$
A
$
$
Suffix
"Alex keeps talking and talking and talking and talking and eventually stops."
Suppose we need to store this string:
How could we store this string and save space?
"talking and talking and talking and talking and"
"talking and" 4
=
"Alex keeps talking and talking and talking and talking and eventually stops."
"Alex keeps" + "talking and" 4 +"eventually stops."
Run-length encoding
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec iaculis risus vulputate dui condimentum congue. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas.
Can you find any repeats?
How can we force repeats?
Sorting the letters does!
.aaaaaaaaaaaabbcccccccccddddddeeeeeeeeeeeeeeeeeeeeeeefggghiiiiiiiiiiiiiiiillllllllmmmmmmmmnnnnnnnnnnoooooooopppppqqrrrrrrrssssssssssssssssstttttttttttttttttttuuuuuuuuuuuuuuuv
a12b2c9d6e23f1g3h1i16l8m8n10o8p5q2r7s17t19u15v1
Run-length encoding
The Burrows-Wheeler Transform (BWT) is a way to sort our strings without losing the original data
(And also search through it!)
Developed by Michael Burrows and David Wheeler in 1994
BANANA
$
A
$
BANAN
NA
$
BANA
$
BANANA
ANA
$
BAN
NANA
$
BA
ANANA
$
B
BANANA
$
A
$
BANAN
NA
$
BANA
$
BANANA
ANA
$
BAN
NANA
$
BA
ANANA
$
B
ANNB$AA
BANANA
First column is more compressible, but we lose context and reversibility
(We can also get first column by sorting the output)
ABRACADABRA$
ABRACADABR
$ABRACADAB
BRA$ABRACA
BRACADABRA
CADABRA$AB
DABRA$ABRA
RA$ABRACAD
RACADABRA$
ADABRA$ABR
ABRA$ABRAC
A$ABRACADA
ACADABRA$A
$
A0
A1
A2
A3
A4
B0
B1
C0
D0
R1
R0
A0
R0
D0
$
R1
C0
A1
A2
A3
A4
B1
B0
ABRACADABR
$ABRACADAB
BRA$ABRACA
BRACADABRA
CADABRA$AB
DABRA$ABRA
RA$ABRACAD
RACADABRA$
ADABRA$ABR
ABRA$ABRAC
A$ABRACADA
ACADABRA$A
$
A
A
A
A
A
B
B
C
D
R
R
A
R
D
$
R
C
A
A
A
A
B
B
The backward search algorithm efficiently finds occurrences of a pattern in a text using the LF-mapping
BWT matrix
Number
Number characters with the number of times they have appeard
F-column
L-column
Reference genome sizes
(~3.2 GB if using u8)
RNA-seq data
Contextualization
The best movie ever is only 1.2 GB
Most computers have 8 - 12 GB of RAM
ABRACADABRA$
ABRA
Suppose I want to find where
is located
1. Find rows with last character in search string (e.g., A) in F-column
2. Note which rows has the next letter (e.g., R) in L-column
3. Work backwards in search string until the first letter
A0
R0
D0
$
R1
C0
A1
A2
A3
A4
B1
B0
$
A0
A1
A2
A3
A4
B0
B1
C0
D0
R1
R0
A
R
ABRACADABR
$ABRACADAB
BRA$ABRACA
BRACADABRA
CADABRA$AB
DABRA$ABRA
RA$ABRACAD
RACADABRA$
ADABRA$ABR
ABRA$ABRAC
A$ABRACADA
ACADABRA$A
A0
R0
D0
$
R1
C0
A1
A2
A3
A4
B1
B0
$
A0
A1
A2
A3
A4
B0
B1
C0
D0
R1
R0
R
B
ABRACADABR
$ABRACADAB
BRA$ABRACA
BRACADABRA
CADABRA$AB
DABRA$ABRA
RA$ABRACAD
RACADABRA$
ADABRA$ABR
ABRA$ABRAC
A$ABRACADA
ACADABRA$A
A0
R0
D0
$
R1
C0
A1
A2
A3
A4
B1
B0
$
A0
A1
A2
A3
A4
B0
B1
C0
D0
R1
R0
B
A
ABRACADABR
$ABRACADAB
BRA$ABRACA
BRACADABRA
CADABRA$AB
DABRA$ABRA
RA$ABRACAD
RACADABRA$
ADABRA$ABR
ABRA$ABRAC
A$ABRACADA
ACADABRA$A
Given the string "mississippi$", complete the following tasks:
Seed
Extend
Read: ATCGATTGCA
k-mers (k=3)
ATC, TCG, CGA, GAT, ATT, TTG, TGC, GCA
Use hash table for rapid lookup of potential matches quickly
Start from seed match and grow in both directions with reference genome
CCGTATCGATTGCAGATG
GAT
[7, 14]
h(k)
Check to see if we can align the read to reference
Lecture 07A:
Quantification -
Foundations
Lecture 06B:
Sequence alignment -
Methodology
Today
Tuesday
Quiz 02
N
A
N
B
$
A
A
A
A
A
$
B
N
N
A
A
A
$
B
N
N
N
A
N
B
$
A
A
AN
AN
A$
$B
BA
NA
NA
AN
AN
A$
$B
BA
NA
NA
N
A
N
B
$
A
A
ANA
ANA
A$B
$BA
BAN
NAN
NA$
ANA
ANA
A$B
$BA
BAN
NAN
NA$
N
A
N
B
$
A
A
ANAN
ANA$
A$BA
$BAN
BANA
NANA
NA$B
BANANA$
NANA$BA
NA$BANA
A$BANAN
$BANANA
ANANA$B
ANA$BAN
2
3
2
3
2
3
2
...
Index: We take the key, count how many characters are in it
Note: This is a bad hashing function since "Alex" and "John" would result in the same index
Example
len("James") = 5
James
883-234-1236