BIOSC 1540: L09 (Quantification)

Computational Biology

(BIOSC 1540)

Sep 24, 2024

Lecture 09:
Quantification

Announcements

A04 is due Friday by 11:59 pm
Exam is next Thursday (Oct 3rd)

After today, you should be able to

1. Discuss the importance of normalization and quantification in RNA-seq data analysis.
2. Explain the relevance of pseudoalignment instead of read mapping.
3. Understand the purpose of Salmon's generative model.
4. Describe how salmon handles experimental biases in transcriptomics data.
5. Communicate the principles of inference in Salmon.

Let's pause and look at the big picture

Suppose we have isolated a normal and cancerous cell

Cancerous

Normal

We want to identify possible drug targets based on overexpressed genes

We will use transcriptomics!

Defining our transcriptome

Let's simplify our problem to only three transcripts

These represent the only mRNA transcripts we will find in our cells (i.e., the transcriptome)

l_3 > l_2> l_1

They have short, medium, and long lengths

t_2

t_3

t_1

Defining our gene expression

We have the following transcript distribution

Normal

Cancerous

More transcripts?

Relative to others?

We have to normalize our transcriptome before making comparisons

2

3

4

We can use transcripts fraction

\frac{2}{9} \approx 0.22

\frac{3}{9} \approx 0.33

\frac{4}{9} \approx 0.44

Ratios are sensitive to total amount

Normal

Cancerous

\frac{2}{9} \approx 0.22

\frac{3}{9} \approx 0.33

\frac{4}{9} \approx 0.44

\frac{4}{15} \approx 0.27

\frac{6}{15} \approx 0.4

\frac{5}{15} \approx 0.33

Because the cancer cell is transcribing more overall, we still get changes across the board

Scaling data to "parts per million"

Real data has more than three transcripts and ratios are substantially smaller (e.g., 0.000001)

This can make computations and communications challenging, so we often scale everything to a million to use unsigned integers

Transcript	Normal	Cancerous
1	222,222	266,666
2	333,333	400,000
3	444,444	333,333
Total	1,000,000	1,000,000

\frac{t_i}{\sum t_i}\cdot 10^6

Small floats require high precision (i.e., float64) and thus memory

Wait, what about sequencing depth?

RPK = \frac{\text{Read counts for gene}}{\text{Gene length in kilobases}}

t_2

t_3

t_1

Longer transcripts will have more reads

Read per kilobase (RPK) corrects this experimental bias through normalization by gene length

(Length is usually just the exons)

RPK example

RPK_1 = \frac{\text{Read counts for gene}}{\text{Gene length in kilobases}}

t_2

t_3

t_1

Reads per kilobase of transcript per million reads mapped

\text{RPKM} = 10^9 \frac{\text{Reads mapped to transcript}}{\text{Total reads} \cdot \text{Transcript length}}

Transcripts per million

\text{TPM} = 10^6 \frac{\text{RPKM}}{\sum_i \text{RPKM}_i}

After today, you should be able to

1. Discuss the importance of normalization and quantification in RNA-seq data analysis.
2. Explain the relevance of pseudoalignment instead of read mapping.
3. Understand the purpose of Salmon's generative model.
4. Describe how salmon handles experimental biases in transcriptomics data.
5. Communicate the principles of inference in Salmon.

Traditional quantification uses read mapping

Transcriptome

Reads/Fragments

We assign each read to single transcript using our read mapping algorithms

Once aligned, we can count the number of mapped reads to each transcript

1

2

4

Bowtie 2 uses Burrows-Wheeler Transform to map and quantify reads

Spliced Transcripts Alignment to a Reference (STAR)

Maximum Mappable Prefix (MMP) approach for fast, accurate spliced alignments

Finds prefix that perfectly matches reference then repeats for unmatched regions

This automatically detects junctions instead of relying on databases

Suppose someone took library books (transcripts) and then shredded them (reads)

In the context of our analogy, we not only need to find the book but which page it was from

This takes a long time

Alignment-based methods need to determine the read's exact position in the transcript

Alignment-based methods are computationally expensive

Identifies which transcripts are compatible with the read, skipping the precise location step

Pseudoalignment finds which transcript, but not where

Analogy: Just find books that are compatible and don't worry about which page

It does not worry about where within that transcript it originated

Alignment

Pseudoalignment

Specifies where exactly in the transcript this read came from
(e.g., at position 478)

Specifies that it came somewhere from this transcript (i.e., compatible)

Pseudoalignment: This method, used by tools like Kallisto, skips the full alignment process. Instead of mapping each read to a specific position, pseudoalignment identifies which transcripts are compatible with a given read

Bypassing alignment accelerates quantification

Pros: Faster and less resource-intensive than alignment-based methods
Cons: It may lack certain details, such as the position and orientation of reads, which are useful for correcting technical biases

After today, you should be able to

1. Discuss the importance of normalization and quantification in RNA-seq data analysis.
2. Explain the relevance of pseudoalignment instead of read mapping.
3. Understand the purpose of Salmon's generative model.
4. Describe how salmon handles experimental biases in transcriptomics data.
5. Communicate the principles of inference in Salmon.

Let's understand our problem

Initial sample

We have to use reads to quantity our initial sample

Has some number of transcripts

Fragments

Reads

After PCR amplification and fragmentation

Sequencing with imperfections

What is a generative model?

Generative model: A statistical model that explains how the observed data are generated from the underlying system

Defines a computational framework that produces sequencing reads from a population of transcripts

First, we have to define our model

Salmon's mathematical definition of a transcriptome

t_2

t_3

t_1

T

c_1 = 2

M = 3

c_2 = 3

c_3 =4

T = \left\{ \left( t_1, \ldots, t_M \right), \left( c_1, \ldots, c_M\right) \right\}

Our whole transcriptome

Individual transcripts

Transcript counts

Salmon's formulation of transcript abundance

So far, we have been talking about transcript fractions

\eta_i = \frac{c_i \tilde{l}_i}{\sum_j^M c_j \tilde{l}_j}

f_i = \frac{c_i}{\sum_j^M c_j}

c_1 = 2

c_2 = 3

c_3 =4

We can also take nucleotide fractions by taking into account the effective length of each transcript

This tells us how much of the total RNA pool comes from each transcript

\tilde{l}_1 = 200

\tilde{l}_2 = 300

\tilde{l}_3 =400

I will explain the effective length later. For now, think of it as a "corrected" length

\eta = \begin{bmatrix} \eta_1 \\ \eta_2 \\ \eta_3 \\ \end{bmatrix}

Converting to relative abundances

The transcript fraction tells us the proportion of total RNA molecules in the sample that come from transcript i

\tau_i = \frac{\frac{\eta_i}{\tilde{l}_i}}{\sum_{j = 1}^M \frac{\eta_j}{\tilde{l}_j}}

\text{TPM}_i = \tau_i \cdot 10^6

This gives the relative abundance of each transcript i

Adjusts for the fact that longer transcripts generate more reads

The transcript fraction normalizes nucleotide fraction by the effective length

\tau_i

TPM is "Transcripts per million"

Transcript-Fragment Assignment Matrix

Z

is a binary matrix (i.e., all values are 0 or 1)

of M transcripts (rows) and N fragments (columns)

if fragment j is assigned to transcript i

Z_{i, j} = 1

Z = \begin{bmatrix} Z_{11} & Z_{12} & \dots & Z_{1N} \\ Z_{21} & Z_{22} & \dots & Z_{2N} \\ \vdots & \vdots & \ddots & \vdots \\ Z_{M1} & Z_{M2} & \dots & Z_{MN} \end{bmatrix}

Transcript 1

Transcript 2

Transcript M

...

Fragment 1

Fragment 2

Fragment N

...

Z example

t_1

t_2

t_3

f_1

f_5

f_7

f_8

f_{10}

f_2

f_9

f_{11}

f_3

f_{6}

f_4

f_{12}

Z = \begin{bmatrix} 1 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 & 1 & 1 & 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 1 & 0 & 1 & 0 & 0 & 1 & 0 & 1 & 1 \\ \end{bmatrix}

t_1

t_2

t_3

f _{1}

f_{2}

f_{3}

f_{4}

f_{5}

f_{6}

f_{7}

f_{8}

f_{9}

f_{10}

f_{11}

f_{12}

Suppose we have 3 transcripts and 12 fragments

Z is just how we computationally assign fragments to transcripts

Generative model inference

Transcript-fragment assignment

Transcript abundance

Z = \begin{bmatrix} Z_{11} & Z_{12} & \dots & Z_{1N} \\ Z_{21} & Z_{22} & \dots & Z_{2N} \\ \vdots & \vdots & \ddots & \vdots \\ Z_{M1} & Z_{M2} & \dots & Z_{MN} \end{bmatrix}

\eta = \begin{bmatrix} \eta_1 \\ \vdots \\ \eta_M \\ \end{bmatrix}

N and M are same as experiment

Given these inputs, generate a distribution of fragments

Run 1

Run 2

t_1

t_2

t_3

Known from organism and experiment

Probability of observing the sequence fragments

Which scenario is more likely, given our generative model?

t_1

t_2

t_3

f_1

f_5

f_7

f_8

f_{10}

f_2

f_9

f_{11}

f_3

f_{6}

f_4

f_{12}

We can use probabilistic methods to find parameters that explain our observed distirbution

f_9

f_7

f_{11}

f_1

f_8

f_3

f_5

f_{10}

f_{6}

f_2

f_4

f_{12}

t_1

t_2

t_3

Conditional probability notation

P \left( a \vert b \right)

This reads, "What is the probability of a occurring if b is true?"

P \left( \text{Rain} \vert \text{Radar} \right)

Given this Radar, what is the probability of Rain in Oakland?

=

Example

Probability of observing the sequenced fragments

P \left( F \vert T, \eta, Z \right)

Given these parameters, how probable is it that our experiment generated these observed reads?

Optimize these values until we get the highest probability

Available transcripts

Transcript-fragment assignment

Z = \begin{bmatrix} Z_{11} & Z_{12} & \dots & Z_{1N} \\ Z_{21} & Z_{22} & \dots & Z_{2N} \\ \vdots & \vdots & \ddots & \vdots \\ Z_{M1} & Z_{M2} & \dots & Z_{MN} \end{bmatrix}

\eta = \begin{bmatrix} \eta_1 \\ \vdots \\ \eta_M \\ \end{bmatrix}

Transcript abundance

After today, you should be able to

1. Discuss the importance of normalization and quantification in RNA-seq data analysis.
2. Explain the relevance of pseudoalignment instead of read mapping.
3. Understand the purpose of Salmon's generative model.
4. Describe how salmon handles experimental biases in transcriptomics data.
5. Communicate the principles of inference in Salmon.

Probability of observing the sequenced fragments

We can now compute the probability of observing:

Set of fragments

Given:

Transcript assignment

Transcript abundance

P \left( F \vert \eta, Z, T \right) = \prod_{j=1}^{N} \sum_{i=1}^{M} \eta_i P \left( f_j | t_i \right)

P \left( f_j | t_i \right)

Probability of observing fragment

given that it comes from transcript

f_j

t_i

This expression accounts for all possible transcripts a fragment might come from, weighted by how likely that fragment is to come from each transcript

F

Z

\eta

Transcriptome

T

Fragment probabilities

P \left( f_j | t_i \right)

is a conditional probability that depends on the position of the fragment within the transcript, the length of the fragment, and any technical biases

In Salmon’s quasi-mapping approach, this probability is approximated based on transcript compatibility rather than exact positions.

P \left( f_j | t_i \right) = P \left( \text{fragment length}, \text{position}, \text{GC content}, \ldots \right)

Positional bias

A transcript’s effective length adjusts for the fact that fragments near the ends of a transcript are less likely to be sampled

\tilde{l}_i = l_i - \mu_i

l_i

Fragments from central regions are more likely to be of optimal length for sequencing reads

Fragments that include transcript ends might be too short

\eta_i = \frac{c_i \tilde{l_i}}{\sum_i c_i \tilde{l_i}}

Mean of the truncated empirical fragment length distribution

\mu_i

\tilde{l}_i < l_i

After today, you should be able to

1. Discuss the importance of normalization and quantification in RNA-seq data analysis.
2. Explain the relevance of pseudoalignment instead of read mapping.
3. Understand the purpose of Salmon's generative model.
4. Describe how salmon handles experimental biases in transcriptomics data.
5. Communicate the principles of inference in Salmon.

Introduction to inference in Salmon

Inference refers to the process of estimating transcript abundances from observed RNA-seq reads using statistical models.
Salmon’s inference process involves estimating the most likely abundance of each transcript that could explain the observed set of fragments (reads).
It does this by solving a complex, high-dimensional problem where each fragment might map to multiple transcripts.

Two-phase inference in salmon

This two-phase approach balances speed (in the online phase) with accuracy (in the offline phase)

Salmon processes reads in two stages

Online phase

Makes fast, initial estimates of transcript abundances as the reads are processed

Offline phase

Refines these initial estimates using more complex optimization techniques

Online phase:
Stochastic variational inference

Initial estimates using quasi-mapping

Quasi-mapping is A fast, lightweight technique used to associate RNA-seq fragments with possible transcripts

Alignment is expensive, so quasi-mapping stops after identify seeds

Essentially early stopping of read mapping

Read mapping

GAT

[7, 14]

h(k)

CCGTATCGATTGCAGATG

Identify seeds, then extend and compute base-by-base alignment

This is what initializes compatible transcripts and abundance

\eta_t \approx \frac{\text{Number of fragments mappting to } t}{\text{Total number of fragments}}

Iteratively update parameters based on mini batches

Mini-batch 1

Mini-batch 2

Mini-batch 3

Take current parameters

Compute derivatives from batch

Update parameters (i.e., abundances)

Repeat for each batch

Offline Phase:
Expectation-Maximization (EM) algorithm

After the online phase, Salmon refines the estimates using a more complex optimization method, typically based on the Expectation-Maximization (EM) algorithm

Offline phase fine tunes transcript abundance

This phase ensures the accuracy of abundance estimates, incorporating the bias corrections learned during the online phase

Likelihood of the Data

This is the probability of observing the entire set of fragments $F$ , given the transcriptome $T$ and nucleotide fractions $η\eta$

The goal is to maximize this likelihood to infer the most likely values of $η\eta$ , which correspond to the relative abundances of the transcripts

The likelihood function is central to the inference process in Salmon:

\mathcal{L} \left\{ \alpha \vert F, Z, T \right\} = \prod_{j = i}^N \sum_{i = 1}^{M} \hat{\eta_i} Pr \left\{ f_j \vert t_i \right\}

Optimize the estimates of α, a vector of the estimated number of reads originating from each transcript

\hat{\eta_i} = \frac{\alpha_i}{\sum_j \alpha_j}

Maximum Likelihood Estimation (MLE)

The goal of maximum likelihood is to find the parameters (transcript abundances) that maximize the probability of the observed data (sequenced reads)

\mathcal{L} \left\{ \alpha \vert F, Z, T \right\} = \prod_{j = i}^N \sum_{i = 1}^{M} \hat{\eta_i} Pr \left\{ f_j \vert t_i \right\}

Optimize the estimates of α, a vector of the estimated number of reads originating from each transcript

Given α, η can be directly computed.

The likelihood function is central to the inference process in Salmon:

Why the EM Algorithm Maximizes the Likelihood

The EM algorithm works by breaking down a difficult problem into two simpler problems:

In the E-step, we estimate the missing information (the assignment of fragments to transcripts) using the current transcript abundance estimates.
In the M-step, we use the estimated assignments to update the transcript abundances, improving the likelihood.

At each iteration, the likelihood of the observed data increases, and the EM algorithm iteratively refines the transcript abundance estimates until it reaches a maximum

Before the next class, you should

A04 is due Friday
Study for exam

Lecture 10:
Differential gene expression

Lecture 09:
Quantification

Today

Thursday

Learning bias parameters

Pr \left\{ f_j \vert t_i \right\} = Pr \left\{ l \vert t_i \right\} \cdot Pr \left\{ p \vert t_i, l \right\} \cdot Pr \left\{ o \vert t_i \right\} \cdot Pr \left\{ a \vert f_j, t_i, p, o, l \right\}

For the first 5,000,000 observations we learn these defined bias parameters

Probability of generating fragment j from transcript i

Probability of drawing a fragment of the inferred length given t

Probability of fragment starting at position p on t

Probability of obtaining a fragment with the given orientation

Probability of alignment of fragment j given these mapping and transcript conditions

Probability of alignment fragment to transcript

Pr \left\{ a \vert f_j, t_i, p, o, l \right\} = Pr \left\{ s_o \right\} \prod_{k=1}^{\vert s \vert} Pr_{\left( \mathcal{M}_k \right\}} \left\{ s_{k-1} \rightarrow s_k \vert f_j, t_i, p, o, l \right\}

Represents the probability of the alignment starting at a particular position or state $s_o$

The probability of the fragment "transitioning to another state"

Essentially, how likely does this fragment align somewhere else

For fun: Hidden Markov Models