BIOSC 1540: L07B (Quantification)

Computational Biology

(BIOSC 1540)

Feb 20, 2025

Lecture 07B

Quantification

Methodology

Announcements

César will provide optional Python recitations on Fridays from 2 - 3 pm (Located in Clapp Hall, room TBD).

Quiz 03 is on Mar 18 and will cover L06B to L08B

Assignments

P02A is due March 14

Quizzes

CBits

After today, you should have a better understanding of

RNA quantification problem formulation

The RNA quantification problem statement

Transcriptome

Reads/Fragments

Given the sequencing reads that were sampled from these transcripts

How many copies of each transcript were in my original sample?

Unknown quantity

Experimental biases and errors

We need to maximize the probability that our generative model and parameters explain our observations

n_1

n_2

n_3

1. Estimate transcript abundance

2. Randomly sample n fragments

We iteratively optimize our transcript abundances until our generated reads look very similar to our observed reads

After today, you should have a better understanding of

Generative models for RNA quantification

Salmon's mathematical definition of a transcriptome

t_2

t_3

t_1

T

c_1 = 2

M = 3

c_2 = 3

c_3 =4

T = \left\{ \left( t_1, \ldots, t_M \right), \left( c_1, \ldots, c_M\right) \right\}

Our whole transcriptome

Individual transcripts

Transcript counts

Salmon's formulation of transcript abundance

So far, we have been talking about transcript fractions

\eta_i = \frac{c_i \tilde{l}_i}{\sum_j^M c_j \tilde{l}_j}

f_i = \frac{c_i}{\sum_j^M c_j}

c_1 = 2

c_2 = 3

c_3 =4

We can also take nucleotide fractions by taking into account the effective length of each transcript

This tells us how much of the total RNA pool comes from each transcript

\tilde{l}_1 = 200

\tilde{l}_2 = 300

\tilde{l}_3 =400

I will explain the effective length later. For now, think of it as a "corrected" length

\eta = \begin{bmatrix} \eta_1 \\ \eta_2 \\ \eta_3 \\ \end{bmatrix}

Converting to relative abundances

The transcript fraction tells us the proportion of total RNA molecules in the sample that come from transcript i

\tau_i = \frac{\frac{\eta_i}{\tilde{l}_i}}{\sum_{j = 1}^M \frac{\eta_j}{\tilde{l}_j}}

\text{TPM}_i = \tau_i \cdot 10^6

This gives the relative abundance of each transcript i

Adjusts for the fact that longer transcripts generate more reads

The transcript fraction normalizes nucleotide fraction by the effective length

\tau_i

TPM is "Transcripts per million"

Transcript-Fragment Assignment Matrix

Z

is a binary matrix (i.e., all values are 0 or 1)

of M transcripts (rows) and N fragments (columns)

if fragment j is assigned to transcript i

Z_{i, j} = 1

Z = \begin{bmatrix} Z_{11} & Z_{12} & \dots & Z_{1N} \\ Z_{21} & Z_{22} & \dots & Z_{2N} \\ \vdots & \vdots & \ddots & \vdots \\ Z_{M1} & Z_{M2} & \dots & Z_{MN} \end{bmatrix}

Transcript 1

Transcript 2

Transcript M

...

Fragment 1

Fragment 2

Fragment N

...

Z example

t_1

t_2

t_3

f_1

f_5

f_7

f_8

f_{10}

f_2

f_9

f_{11}

f_3

f_{6}

f_4

f_{12}

Z = \begin{bmatrix} 1 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 & 1 & 1 & 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 1 & 0 & 1 & 0 & 0 & 1 & 0 & 1 & 1 \\ \end{bmatrix}

t_1

t_2

t_3

f _{1}

f_{2}

f_{3}

f_{4}

f_{5}

f_{6}

f_{7}

f_{8}

f_{9}

f_{10}

f_{11}

f_{12}

Suppose we have 3 transcripts and 12 fragments

Z is just how we computationally assign fragments to transcripts

Generative model inference

Transcript-fragment assignment

Transcript abundance

Z = \begin{bmatrix} Z_{11} & Z_{12} & \dots & Z_{1N} \\ Z_{21} & Z_{22} & \dots & Z_{2N} \\ \vdots & \vdots & \ddots & \vdots \\ Z_{M1} & Z_{M2} & \dots & Z_{MN} \end{bmatrix}

\eta = \begin{bmatrix} \eta_1 \\ \vdots \\ \eta_M \\ \end{bmatrix}

N and M are same as experiment

Given these inputs, generate a distribution of fragments

Run 1

Run 2

t_1

t_2

t_3

Known from organism and experiment

Probability of observing the sequence fragments

Which scenario is more likely, given our generative model?

t_1

t_2

t_3

f_1

f_5

f_7

f_8

f_{10}

f_2

f_9

f_{11}

f_3

f_{6}

f_4

f_{12}

We can use probabilistic methods to find parameters that explain our observed distirbution

f_9

f_7

f_{11}

f_1

f_8

f_3

f_5

f_{10}

f_{6}

f_2

f_4

f_{12}

t_1

t_2

t_3

Probability of observing the sequenced fragments

P \left( F \vert T, \eta, Z \right)

Given these parameters, how probable is it that our experiment generated these observed reads?

Optimize these values until we get the highest probability

Available transcripts

Transcript-fragment assignment

Z = \begin{bmatrix} Z_{11} & Z_{12} & \dots & Z_{1N} \\ Z_{21} & Z_{22} & \dots & Z_{2N} \\ \vdots & \vdots & \ddots & \vdots \\ Z_{M1} & Z_{M2} & \dots & Z_{MN} \end{bmatrix}

\eta = \begin{bmatrix} \eta_1 \\ \vdots \\ \eta_M \\ \end{bmatrix}

Transcript abundance

After today, you should have a better understanding of

Probability optimization instead of generation

Probability of observing the sequenced fragments

We can now compute the probability of observing:

Set of fragments

Given:

Transcript assignment

Transcript abundance

F

Z

\eta

Transcriptome

T

P \left( F \vert \eta, Z, T \right) = \prod_{j=1}^{N} \sum_{i=1}^{M} \eta_i P \left( f_j | t_i \right)

P \left( f_j | t_i \right)

Probability of observing fragment

given that it comes from transcript

f_j

t_i

This expression accounts for all possible transcripts a fragment might come from, weighted by how likely that fragment is to come from each transcript

Fragment probabilities

P \left( f_j | t_i \right)

is a conditional probability that depends on the position of the fragment within the transcript, the length of the fragment, and any technical biases

In Salmon’s quasi-mapping approach, this probability is approximated based on transcript compatibility rather than exact positions.

P \left( f_j | t_i \right) = P \left( \text{fragment length}, \text{position}, \text{GC content}, \ldots \right)

Positional bias

A transcript’s effective length adjusts for the fact that fragments near the ends of a transcript are less likely to be sampled

\tilde{l}_i = l_i - \mu_i

l_i

Fragments from central regions are more likely to be of optimal length for sequencing reads

Fragments that include transcript ends might be too short

\eta_i = \frac{c_i \tilde{l_i}}{\sum_i c_i \tilde{l_i}}

Mean of the truncated empirical fragment length distribution

\mu_i

\tilde{l}_i < l_i

After today, you should have a better understanding of

Probability maximization with inference

Two-phase inference in salmon

This two-phase approach balances speed (in the online phase) with accuracy (in the offline phase)

Salmon processes reads in two stages

Online phase

Makes fast, initial estimates of transcript abundances as the reads are processed

Offline phase

Refines these initial estimates using more complex optimization techniques

Inference refers to the process of estimating transcript abundances from observed RNA-seq reads using statistical models.

Online phase:
Stochastic variational inference

Initial estimates using quasi-mapping

Quasi-mapping is A fast, lightweight technique used to associate RNA-seq fragments with possible transcripts

Alignment is expensive, so quasi-mapping stops after identify seeds

Essentially early stopping of read mapping

Read mapping

GAT

[7, 14]

h(k)

CCGTATCGATTGCAGATG

Identify seeds, then extend and compute base-by-base alignment

This is what initializes compatible transcripts and abundance

\eta_t \approx \frac{\text{Number of fragments mappting to } t}{\text{Total number of fragments}}

Iteratively update parameters based on mini batches

Mini-batch 1

Mini-batch 2

Mini-batch 3

Take current parameters

Compute derivatives from batch

Update parameters (i.e., abundances)

Repeat for each batch

Offline Phase:
Expectation-Maximization (EM) algorithm

After the online phase, Salmon refines the estimates using a more complex optimization method, typically based on the Expectation-Maximization (EM) algorithm

Offline phase fine tunes transcript abundance

This phase ensures the accuracy of abundance estimates, incorporating the bias corrections learned during the online phase

Likelihood of the Data

This is the probability of observing the entire set of fragments $F$ , given the transcriptome $T$ and nucleotide fractions $η\eta$

The goal is to maximize this likelihood to infer the most likely values of $η\eta$ , which correspond to the relative abundances of the transcripts

The likelihood function is central to the inference process in Salmon:

\mathcal{L} \left\{ \alpha \vert F, Z, T \right\} = \prod_{j = i}^N \sum_{i = 1}^{M} \hat{\eta_i} Pr \left\{ f_j \vert t_i \right\}

Optimize the estimates of α, a vector of the estimated number of reads originating from each transcript

\hat{\eta_i} = \frac{\alpha_i}{\sum_j \alpha_j}

Maximum Likelihood Estimation (MLE)

The goal of maximum likelihood is to find the parameters (transcript abundances) that maximize the probability of the observed data (sequenced reads)

\mathcal{L} \left\{ \alpha \vert F, Z, T \right\} = \prod_{j = i}^N \sum_{i = 1}^{M} \hat{\eta_i} Pr \left\{ f_j \vert t_i \right\}

Optimize the estimates of α, a vector of the estimated number of reads originating from each transcript

Given α, η can be directly computed.

The likelihood function is central to the inference process in Salmon:

Why the EM Algorithm Maximizes the Likelihood

The EM algorithm works by breaking down a difficult problem into two simpler problems:

In the E-step, we estimate the missing information (the assignment of fragments to transcripts) using the current transcript abundance estimates.
In the M-step, we use the estimated assignments to update the transcript abundances, improving the likelihood.

At each iteration, the likelihood of the observed data increases, and the EM algorithm iteratively refines the transcript abundance estimates until it reaches a maximum

Before the next class, you should

Work on P02A (due Mar 14)

Lecture 08A:

Differential gene expression -
Foundations

Lecture 07B:

Quantification -
Methodology

Today

Tuesday

Learning bias parameters

Pr \left\{ f_j \vert t_i \right\} = Pr \left\{ l \vert t_i \right\} \cdot Pr \left\{ p \vert t_i, l \right\} \cdot Pr \left\{ o \vert t_i \right\} \cdot Pr \left\{ a \vert f_j, t_i, p, o, l \right\}

For the first 5,000,000 observations we learn these defined bias parameters

Probability of generating fragment j from transcript i

Probability of drawing a fragment of the inferred length given t

Probability of fragment starting at position p on t

Probability of obtaining a fragment with the given orientation

Probability of alignment of fragment j given these mapping and transcript conditions

Probability of alignment fragment to transcript

Pr \left\{ a \vert f_j, t_i, p, o, l \right\} = Pr \left\{ s_o \right\} \prod_{k=1}^{\vert s \vert} Pr_{\left( \mathcal{M}_k \right\}} \left\{ s_{k-1} \rightarrow s_k \vert f_j, t_i, p, o, l \right\}

Represents the probability of the alignment starting at a particular position or state $s_o$

The probability of the fragment "transitioning to another state"

Essentially, how likely does this fragment align somewhere else

For fun: Hidden Markov Models