Loading
aalexmmaldonado
This is a live streamed presentation. You will automatically follow the presenter and see the slide they're currently on.
Computational Biology
(BIOSC 1540)
Feb 20, 2025
Lecture 07B
Quantification
Methodology
Assignments
Quizzes
CBits
Transcriptome
Reads/Fragments
Given the sequencing reads that were sampled from these transcripts
How many copies of each transcript were in my original sample?
Unknown quantity
Experimental biases and errors
1. Estimate transcript abundance
2. Randomly sample n fragments
We iteratively optimize our transcript abundances until our generated reads look very similar to our observed reads
Our whole transcriptome
Individual transcripts
Transcript counts
So far, we have been talking about transcript fractions
We can also take nucleotide fractions by taking into account the effective length of each transcript
This tells us how much of the total RNA pool comes from each transcript
I will explain the effective length later. For now, think of it as a "corrected" length
The transcript fraction tells us the proportion of total RNA molecules in the sample that come from transcript i
This gives the relative abundance of each transcript i
Adjusts for the fact that longer transcripts generate more reads
The transcript fraction normalizes nucleotide fraction by the effective length
TPM is "Transcripts per million"
is a binary matrix (i.e., all values are 0 or 1)
of M transcripts (rows) and N fragments (columns)
if fragment j is assigned to transcript i
Transcript 1
Transcript 2
Transcript M
...
Fragment 1
Fragment 2
Fragment N
...
Suppose we have 3 transcripts and 12 fragments
Z is just how we computationally assign fragments to transcripts
Transcript-fragment assignment
Transcript abundance
N and M are same as experiment
Given these inputs, generate a distribution of fragments
Run 1
Run 2
Known from organism and experiment
Which scenario is more likely, given our generative model?
We can use probabilistic methods to find parameters that explain our observed distirbution
Given these parameters, how probable is it that our experiment generated these observed reads?
Optimize these values until we get the highest probability
Available transcripts
Transcript-fragment assignment
Transcript abundance
We can now compute the probability of observing:
Set of fragments
Given:
Transcript assignment
Transcript abundance
Transcriptome
Probability of observing fragment
given that it comes from transcript
This expression accounts for all possible transcripts a fragment might come from, weighted by how likely that fragment is to come from each transcript
is a conditional probability that depends on the position of the fragment within the transcript, the length of the fragment, and any technical biases
In Salmon’s quasi-mapping approach, this probability is approximated based on transcript compatibility rather than exact positions.
A transcript’s effective length adjusts for the fact that fragments near the ends of a transcript are less likely to be sampled
Fragments from central regions are more likely to be of optimal length for sequencing reads
Fragments that include transcript ends might be too short
Mean of the truncated empirical fragment length distribution
This two-phase approach balances speed (in the online phase) with accuracy (in the offline phase)
Salmon processes reads in two stages
Online phase
Makes fast, initial estimates of transcript abundances as the reads are processed
Offline phase
Refines these initial estimates using more complex optimization techniques
Inference refers to the process of estimating transcript abundances from observed RNA-seq reads using statistical models.
Quasi-mapping is A fast, lightweight technique used to associate RNA-seq fragments with possible transcripts
Alignment is expensive, so quasi-mapping stops after identify seeds
Essentially early stopping of read mapping
Read mapping
GAT
[7, 14]
h(k)
CCGTATCGATTGCAGATG
Identify seeds, then extend and compute base-by-base alignment
This is what initializes compatible transcripts and abundance
Mini-batch 1
Mini-batch 2
Mini-batch 3
Take current parameters
Compute derivatives from batch
Update parameters (i.e., abundances)
Repeat for each batch
After the online phase, Salmon refines the estimates using a more complex optimization method, typically based on the Expectation-Maximization (EM) algorithm
This phase ensures the accuracy of abundance estimates, incorporating the bias corrections learned during the online phase
This is the probability of observing the entire set of fragments FFF, given the transcriptome TTT and nucleotide fractions η\etaη
The goal is to maximize this likelihood to infer the most likely values of η\etaη, which correspond to the relative abundances of the transcripts
The likelihood function is central to the inference process in Salmon:
Optimize the estimates of α, a vector of the estimated number of reads originating from each transcript
The goal of maximum likelihood is to find the parameters (transcript abundances) that maximize the probability of the observed data (sequenced reads)
Optimize the estimates of α, a vector of the estimated number of reads originating from each transcript
Given α, η can be directly computed.
The likelihood function is central to the inference process in Salmon:
The EM algorithm works by breaking down a difficult problem into two simpler problems:
At each iteration, the likelihood of the observed data increases, and the EM algorithm iteratively refines the transcript abundance estimates until it reaches a maximum
Lecture 08A:
Differential gene expression -
Foundations
Lecture 07B:
Quantification -
Methodology
Today
Tuesday
For the first 5,000,000 observations we learn these defined bias parameters
Probability of generating fragment j from transcript i
Probability of drawing a fragment of the inferred length given t
Probability of fragment starting at position p on t
Probability of obtaining a fragment with the given orientation
Probability of alignment of fragment j given these mapping and transcript conditions
Represents the probability of the alignment starting at a particular position or state sos_o
The probability of the fragment "transitioning to another state"
Essentially, how likely does this fragment align somewhere else
For fun: Hidden Markov Models