Computational Biology
(BIOSC 1540)
Sep 24, 2024
Lecture 09:
Quantification
1. Discuss the importance of normalization and quantification in RNA-seq data analysis.
2. Explain the relevance of pseudoalignment instead of read mapping.
3. Understand the purpose of Salmon's generative model.
4. Describe how salmon handles experimental biases in transcriptomics data.
5. Communicate the principles of inference in Salmon.
Suppose we have isolated a normal and cancerous cell
Cancerous
Normal
We want to identify possible drug targets based on overexpressed genes
We will use transcriptomics!
Let's simplify our problem to only three transcripts
These represent the only mRNA transcripts we will find in our cells (i.e., the transcriptome)
They have short, medium, and long lengths
We have the following transcript distribution
Normal
Cancerous
More transcripts?
Relative to others?
2
3
4
We can use transcripts fraction
Normal
Cancerous
Because the cancer cell is transcribing more overall, we still get changes across the board
Real data has more than three transcripts and ratios are substantially smaller (e.g., 0.000001)
This can make computations and communications challenging, so we often scale everything to a million to use unsigned integers
Transcript | Normal | Cancerous |
---|---|---|
1 | 222,222 | 266,666 |
2 | 333,333 | 400,000 |
3 | 444,444 | 333,333 |
Total | 1,000,000 | 1,000,000 |
Small floats require high precision (i.e., float64) and thus memory
Longer transcripts will have more reads
Read per kilobase (RPK) corrects this experimental bias through normalization by gene length
(Length is usually just the exons)
1. Discuss the importance of normalization and quantification in RNA-seq data analysis.
2. Explain the relevance of pseudoalignment instead of read mapping.
3. Understand the purpose of Salmon's generative model.
4. Describe how salmon handles experimental biases in transcriptomics data.
5. Communicate the principles of inference in Salmon.
Transcriptome
Reads/Fragments
We assign each read to single transcript using our read mapping algorithms
Once aligned, we can count the number of mapped reads to each transcript
1
2
4
Maximum Mappable Prefix (MMP) approach for fast, accurate spliced alignments
Finds prefix that perfectly matches reference then repeats for unmatched regions
This automatically detects junctions instead of relying on databases
Suppose someone took library books (transcripts) and then shredded them (reads)
In the context of our analogy, we not only need to find the book but which page it was from
This takes a long time
Alignment-based methods need to determine the read's exact position in the transcript
Identifies which transcripts are compatible with the read, skipping the precise location step
Analogy: Just find books that are compatible and don't worry about which page
It does not worry about where within that transcript it originated
Alignment
Pseudoalignment
Specifies where exactly in the transcript this read came from
(e.g., at position 478)
Specifies that it came somewhere from this transcript (i.e., compatible)
Pseudoalignment: This method, used by tools like Kallisto, skips the full alignment process. Instead of mapping each read to a specific position, pseudoalignment identifies which transcripts are compatible with a given read
1. Discuss the importance of normalization and quantification in RNA-seq data analysis.
2. Explain the relevance of pseudoalignment instead of read mapping.
3. Understand the purpose of Salmon's generative model.
4. Describe how salmon handles experimental biases in transcriptomics data.
5. Communicate the principles of inference in Salmon.
Initial sample
We have to use reads to quantity our initial sample
Has some number of transcripts
Fragments
Reads
After PCR amplification and fragmentation
Sequencing with imperfections
Generative model: A statistical model that explains how the observed data are generated from the underlying system
Defines a computational framework that produces sequencing reads from a population of transcripts
First, we have to define our model
Our whole transcriptome
Individual transcripts
Transcript counts
So far, we have been talking about transcript fractions
We can also take nucleotide fractions by taking into account the effective length of each transcript
This tells us how much of the total RNA pool comes from each transcript
I will explain the effective length later. For now, think of it as a "corrected" length
The transcript fraction tells us the proportion of total RNA molecules in the sample that come from transcript i
This gives the relative abundance of each transcript i
Adjusts for the fact that longer transcripts generate more reads
The transcript fraction normalizes nucleotide fraction by the effective length
TPM is "Transcripts per million"
is a binary matrix (i.e., all values are 0 or 1)
of M transcripts (rows) and N fragments (columns)
if fragment j is assigned to transcript i
Transcript 1
Transcript 2
Transcript M
...
Fragment 1
Fragment 2
Fragment N
...
Suppose we have 3 transcripts and 12 fragments
Z is just how we computationally assign fragments to transcripts
Transcript-fragment assignment
Transcript abundance
N and M are same as experiment
Given these inputs, generate a distribution of fragments
Run 1
Run 2
Known from organism and experiment
Which scenario is more likely, given our generative model?
We can use probabilistic methods to find parameters that explain our observed distirbution
This reads, "What is the probability of a occurring if b is true?"
Given this Radar, what is the probability of Rain in Oakland?
=
Given these parameters, how probable is it that our experiment generated these observed reads?
Optimize these values until we get the highest probability
Available transcripts
Transcript-fragment assignment
Transcript abundance
1. Discuss the importance of normalization and quantification in RNA-seq data analysis.
2. Explain the relevance of pseudoalignment instead of read mapping.
3. Understand the purpose of Salmon's generative model.
4. Describe how salmon handles experimental biases in transcriptomics data.
5. Communicate the principles of inference in Salmon.
We can now compute the probability of observing:
Set of fragments
Given:
Transcript assignment
Transcript abundance
Probability of observing fragment
given that it comes from transcript
This expression accounts for all possible transcripts a fragment might come from, weighted by how likely that fragment is to come from each transcript
Transcriptome
is a conditional probability that depends on the position of the fragment within the transcript, the length of the fragment, and any technical biases
In Salmon’s quasi-mapping approach, this probability is approximated based on transcript compatibility rather than exact positions.
A transcript’s effective length adjusts for the fact that fragments near the ends of a transcript are less likely to be sampled
Fragments from central regions are more likely to be of optimal length for sequencing reads
Fragments that include transcript ends might be too short
Mean of the truncated empirical fragment length distribution
1. Discuss the importance of normalization and quantification in RNA-seq data analysis.
2. Explain the relevance of pseudoalignment instead of read mapping.
3. Understand the purpose of Salmon's generative model.
4. Describe how salmon handles experimental biases in transcriptomics data.
5. Communicate the principles of inference in Salmon.
Inference refers to the process of estimating transcript abundances from observed RNA-seq reads using statistical models.
This two-phase approach balances speed (in the online phase) with accuracy (in the offline phase)
Salmon processes reads in two stages
Online phase
Makes fast, initial estimates of transcript abundances as the reads are processed
Offline phase
Refines these initial estimates using more complex optimization techniques
Quasi-mapping is A fast, lightweight technique used to associate RNA-seq fragments with possible transcripts
Alignment is expensive, so quasi-mapping stops after identify seeds
Essentially early stopping of read mapping
Read mapping
GAT
[7, 14]
h(k)
CCGTATCGATTGCAGATG
Identify seeds, then extend and compute base-by-base alignment
This is what initializes compatible transcripts and abundance
Mini-batch 1
Mini-batch 2
Mini-batch 3
Take current parameters
Compute derivatives from batch
Update parameters (i.e., abundances)
Repeat for each batch
After the online phase, Salmon refines the estimates using a more complex optimization method, typically based on the Expectation-Maximization (EM) algorithm
This phase ensures the accuracy of abundance estimates, incorporating the bias corrections learned during the online phase
This is the probability of observing the entire set of fragments FFF, given the transcriptome TTT and nucleotide fractions η\etaη
The goal is to maximize this likelihood to infer the most likely values of η\etaη, which correspond to the relative abundances of the transcripts
The likelihood function is central to the inference process in Salmon:
Optimize the estimates of α, a vector of the estimated number of reads originating from each transcript
The goal of maximum likelihood is to find the parameters (transcript abundances) that maximize the probability of the observed data (sequenced reads)
Optimize the estimates of α, a vector of the estimated number of reads originating from each transcript
Given α, η can be directly computed.
The likelihood function is central to the inference process in Salmon:
The EM algorithm works by breaking down a difficult problem into two simpler problems:
At each iteration, the likelihood of the observed data increases, and the EM algorithm iteratively refines the transcript abundance estimates until it reaches a maximum
Lecture 10:
Differential gene expression
Lecture 09:
Quantification
Today
Thursday
For the first 5,000,000 observations we learn these defined bias parameters
Probability of generating fragment j from transcript i
Probability of drawing a fragment of the inferred length given t
Probability of fragment starting at position p on t
Probability of obtaining a fragment with the given orientation
Probability of alignment of fragment j given these mapping and transcript conditions
Represents the probability of the alignment starting at a particular position or state sos_o
The probability of the fragment "transitioning to another state"
Essentially, how likely does this fragment align somewhere else
For fun: Hidden Markov Models