Computational Biology

(BIOSC 1540)

Feb 18, 2025

Lecture 07A

RNA Quantification

Foundations

Announcements

César will provide optional Python recitations on Fridays from 2 - 3 pm (Location TBD).
Please fill out the Canvas discussion for CBit 06

Quiz 02 is today
Quiz 03 is on Mar 18 and will cover L06B to L08B

Assignments

P02A will be released today or tomorrow

Quizzes

~~CBytes~~

CBits

After today, you should have a better understanding of

Quiz 02

Please put away all materials as we distribute the quiz

Fill out the cover page, and do not start yet

Sit with an empty seat between you and your neighbors for the quiz

Quiz ends around 9:51 am

When you are finished, please hold on to your quiz and feel free to doodle, write anything, tell me a joke, etc. on the last page

After today, you should have a better understanding of

The importance of quantification in RNA-seq data analysis

Let's remember the big picture: We want to quantify gene expression differences

Suppose we have isolated a normal and cancerous cell

Cancerous

Normal

We want to identify possible drug targets based on overexpressed genes

We have to quantify the amount of transcripts in our cell(s)

Our transcriptome is the set of mRNA transcripts our cell could have

Let's consider only three transcripts

l_3 > l_2> l_1

They have short, medium, and long lengths

t_2

t_3

t_1

We first have to define what possible transcripts our cells have

RNA quantification aims to determine the number of transcripts in my original sample

Normal

Cancerous

Example distribution of mRNA under different conditions

The RNA quantification problem statement

Transcriptome

Reads/Fragments

Given the sequencing reads that were sampled from these transcripts

How many copies of each transcript were in my original sample?

Unknown quantity

Experimental biases and errors

After today, you should have a better understanding of

The relevance of pseudoalignment

One way we could quantify transcripts is to use read mapping (i.e., read alignment)

Transcriptome

Reads/Fragments

We align each read to single transcript using our read mapping algorithms

1

2

4

Number of reads mapped to each transcript

Example: Bowtie 2 uses a FM-index for read mapping

DOI: 10.1038/nmeth.1923

1. Extract k-mer seeds from all reads in our sample

2. Use FM-index of transcriptome to search for k-mer matches

3. Extend alignments starting from k-mer seeds

These approaches are extremely slow (we have 30 to 60 million reads)

Suppose someone took library books (transcripts) and then shredded them (reads)

We would need to find which book each piece came from, but also which page

This would take a long time

Relating it back: Alignment-based methods need to determine the read's exact position in the transcript

Alignment-based methods are computationally expensive

Why?

We want to tape these books back together by using another library's copy as a reference

Strategy: Identify which transcripts are compatible with the read, skipping the precise location (i.e., alignment) step

Pseudoalignment finds which transcript reads came from but not the exact position

Instead of having to find the exact page each piece is from, suppose we just need to figure out which book

This is all we need to know for RNA quantification: Where did this read come from?

Alignment

Pseudoalignment

Specifies where exactly in the transcript this read came from
(e.g., at position 478)

Specifies that it came somewhere from this transcript (i.e., compatible)

Reads

Transcripts

Bypassing alignment accelerates quantification, but how can we do this?

Comparison

After today, you should have a better understanding of

Understand the use of generative models

Let's understand our problem

Initial sample

We can only use reads to quantity our initial mRNA sample

Has some number of transcripts

Fragments

Reads

After PCR amplification and fragmentation

Sequencing with errors

We can use a generative model to back-calculate transcript quantities from our reads

Generative model: A statistical model that explains how the observed data are generated from the underlying system

Defines a computational framework that produces sequencing reads from a transcriptome

Let's walk through a conceptual example

Exploring generative modeling with a bag of marbles

n_1

n_2

n_3

Suppose we take some number of green, blue, and red marbles (i.e., transcripts)

We put the marbles into a bag and crush them (i.e., PCR amplification and fragmentation)

Exploring generative modeling with a bag of marbles

We then take a handful of marble fragments (i.e., reads)

How would you estimate the distribution of marbles in the bag?

and then determine their color (i.e., map)

Suppose our marbles have different characteristics

Red marbles are twice the size

These differences are called biases

Blue marbles break easily

How would you adjust your approach?

Generative models estimate parameters that explain our observed experimental results

n_1

n_2

n_3

1. Guess some number of marbles

0. Determine/learn bias probabilities of sampling that specify our framework

2. Randomly sample n fragments

3. Compare to our original distribution

The probability of grabbing a blue fragment is 1.4x the green probability

The probability of grabbing a red fragment is 2.0x the green probability

If our simulated distribution closely matches our observed distribution, we have a likely estimate of the original (hidden) distribution

Does not match

We need to maximize the probability that our generative model and parameters explain our observations

n_1

n_2

n_3

1. Guess some number of marbles

2. Randomly sample n fragments

We keep trying new marble combinations until our generated fragments look very similar to our observed fragments

This is conceptually similar to how RNA quantification methods work (e.g., Salmon)

Wait, what about sequencing depth?

RPK = \frac{\text{Read counts for gene}}{\text{Gene length in kilobases}}

t_2

t_3

t_1

Longer transcripts will have more reads

Read per kilobase (RPK) corrects this experimental bias through normalization by gene length

(Length is usually just the exons)

RPK example

RPK_1 = \frac{\text{Read counts for gene}}{\text{Gene length in kilobases}}

t_2

t_3

t_1

Reads per kilobase of transcript per million reads mapped

\text{RPKM} = 10^9 \frac{\text{Reads mapped to transcript}}{\text{Total reads} \cdot \text{Transcript length}}

Transcripts per million

\text{TPM} = 10^6 \frac{\text{RPKM}}{\sum_i \text{RPKM}_i}

Learning bias parameters

Pr \left\{ f_j \vert t_i \right\} = Pr \left\{ l \vert t_i \right\} \cdot Pr \left\{ p \vert t_i, l \right\} \cdot Pr \left\{ o \vert t_i \right\} \cdot Pr \left\{ a \vert f_j, t_i, p, o, l \right\}

For the first 5,000,000 observations we learn these defined bias parameters

Probability of generating fragment j from transcript i

Probability of drawing a fragment of the inferred length given t

Probability of fragment starting at position p on t

Probability of obtaining a fragment with the given orientation

Probability of alignment of fragment j given these mapping and transcript conditions

Spliced Transcripts Alignment to a Reference (STAR)

Maximum Mappable Prefix (MMP) approach for fast, accurate spliced alignments

Finds prefix that perfectly matches reference then repeats for unmatched regions

This automatically detects junctions instead of relying on databases

Probability of alignment fragment to transcript

Pr \left\{ a \vert f_j, t_i, p, o, l \right\} = Pr \left\{ s_o \right\} \prod_{k=1}^{\vert s \vert} Pr_{\left( \mathcal{M}_k \right\}} \left\{ s_{k-1} \rightarrow s_k \vert f_j, t_i, p, o, l \right\}

Represents the probability of the alignment starting at a particular position or state $s_o$

The probability of the fragment "transitioning to another state"

Essentially, how likely does this fragment align somewhere else

For fun: Hidden Markov Models

We have to normalize our transcriptome before making comparisons

2

3

4

We can use transcripts fraction

\frac{2}{9} \approx 0.22

\frac{3}{9} \approx 0.33

\frac{4}{9} \approx 0.44

Ratios are sensitive to total amount

Normal

Cancerous

\frac{2}{9} \approx 0.22

\frac{3}{9} \approx 0.33

\frac{4}{9} \approx 0.44

\frac{4}{15} \approx 0.27

\frac{6}{15} \approx 0.4

\frac{5}{15} \approx 0.33

Because the cancer cell is transcribing more overall, we still get changes across the board

Scaling data to "parts per million"

Real data has more than three transcripts and ratios are substantially smaller (e.g., 0.000001)

This can make computations and communications challenging, so we often scale everything to a million to use unsigned integers

Transcript	Normal	Cancerous
1	222,222	266,666
2	333,333	400,000
3	444,444	333,333
Total	1,000,000	1,000,000

\frac{t_i}{\sum t_i}\cdot 10^6

Small floats require high precision (i.e., float64) and thus memory

Announcements

After today, you should have a better understanding of

Quiz 02

Please put away all materials as we distribute the quiz

Fill out the cover page, and do not start yet

Sit with an empty seat between you and your neighbors for the quiz

Quiz ends around 9:51 am

After today, you should have a better understanding of

The importance of quantification in RNA-seq data analysis

Let's remember the big picture: We want to quantify gene expression differences

Our transcriptome is the set of mRNA transcripts our cell could have

RNA quantification aims to determine the number of transcripts in my original sample

The RNA quantification problem statement

After today, you should have a better understanding of

The relevance of pseudoalignment

One way we could quantify transcripts is to use read mapping (i.e., read alignment)

Example: Bowtie 2 uses a FM-index for read mapping

Alignment-based methods are computationally expensive

Pseudoalignment finds which transcript reads came from but not the exact position

Comparison

After today, you should have a better understanding of

Understand the use of generative models

Let's understand our problem

We can use a generative model to back-calculate transcript quantities from our reads

Exploring generative modeling with a bag of marbles

Exploring generative modeling with a bag of marbles

Suppose our marbles have different characteristics

Generative models estimate parameters that explain our observed experimental results

We need to maximize the probability that our generative model and parameters explain our observations

After today, you should have a better understanding of

Python practice

Before the next class, you should

Wait, what about sequencing depth?

RPK example

Reads per kilobase of transcript per million reads mapped

Transcripts per million

Learning bias parameters

Spliced Transcripts Alignment to a Reference (STAR)

Probability of alignment fragment to transcript

We have to normalize our transcriptome before making comparisons

Ratios are sensitive to total amount

Scaling data to "parts per million"