

Computational Biology
(BIOSC 1540)
Feb 25, 2025
Lecture 08A
Differential gene expression
Foundations
Announcements
- César optional Python recitations are on Fridays from 2 - 3 pm in L1 Clapp Hall
- Please fill out the Canvas discussion for CBit 07
Assignments
Quizzes
CBits
After today, you should have a better understanding of
Hypothesis testing for comparing gene expression
Let's remember the big picture: We want to quantify differences in gene expression
Differential gene expression quantifies changes in gene expression levels between different sample groups or conditions
We have been focused on quantifying gene expression in quantities like Transcripts Per Million (TPM)
Samples
Normal
Cancerous
We cannot rely on simple comparisons when analyzing gene expression
We could technically directly compare means between our different conditions
However, biological data are inherently noisy, and observed differences may arise by chance
Examples of experimental biases (besides sample variation)
Sequencing depth: Higher depth could appear as higher expression levels simply due to having more data
Batch effects: Processing sampling with different equipment, reagents, times, etc. can show systematic differences
Normal
Cancerous
TPM
Samples
We need approaches that address these sources of variation and noise
Normal
Cancerous
TPM
Samples
Statistical models can account for variability and separate signal from noise
Hypothesis testing between statistical models provides a quantitative way to compare conditions
Hypothesis testing in RNA-seq data
After fitting a statistical model, we need to perform hypothesis testing to see if the difference in expression between conditions is statistically significant
Gene expression
Normal
Cancerous
We reject the null hypothesis when our statistical test demonstrates that the observed difference, if any, is unlikely to have happened by random chance
Null Hypothesis (H₀): There is no difference in gene expression between the two conditions
Alternative Hypothesis (H₁): There is a significant difference in gene expression between the conditions
We have two hypotheses:
The p-value is the probability of the null hypothesis being true
What is the probability that any difference is either (1) nonexistent or (2) due to random chance
Probability value (p-value):
The higher the p-value, the more our model supports the null hypothesis
The lower the p-value, the more our model supports the alternative hypothesis
Gene expression
Normal
Cancerous
Gene expression
Normal
Cancerous
After today, you should have a better understanding of
Reliable statistical models for gene expression data
Binomial distribution
To compute probabilities under H₀, we need a model that describes expected variation
A statistical model describes how data is expected to behave if H₀ is true.
For example, a fair coin flip should result in a normal distribution centered on 50% of each side

This is our statistical model that describes our coin flip observations under H0
We are probably flipping a weighted coin because our observations do not match our H0 statistical model
If we flip a coin 10 million times and our distribution looks like
this
Gene expression data have unique challenges that require specific statistical models
The nature of count data
RNA-seq generates count data – the number of RNA fragments that map to each gene
Gene expression
Normal
Cancerous
Example: 573,282 TPM
Discrete data requires us to use special statistical models
- Data that can only take whole numbers
- In RNA-seq, we measure the number of transcripts, so the data are count-based
- For example, you cannot have "half a transcript"
What is discrete data?
Binomial: A Simple Model for Discrete Counts
The Binomial distribution models the number of successes in a fixed number of independent trials, where each trial has the same probability of success

Number of trials
Number of successes
Probability
Probability of success
RNA-seq analogy: Each read can be considered a "trial," and the probability that a read maps to a specific gene is the "probability of success."
After today, you should have a better understanding of
Reliable statistical models for gene expression data
Poisson distribution
Challenge #1: The binomial distribution assumes that the probability of success (p) is the same for every trial
For example, if I have 10 samples from cancerous cells, the binomial distribution assumes they are perfect replicates with no biases
Ignoring sample-to-sample variability can lead to underestimating the true uncertainty in the data
Number of trials
Number of successes
Probability
Probability of success
Challenge #2: High sequencing depth results in an extremely large number of trials, posing both computational and modeling challenges
When sequencing depth is high, n (the total number of reads) becomes very large
Factorials when n is large makes accurate calculations impractical
Challenge #3: For many genes, the probability of expression (p) is extremely low, further complicating the use of the binomial distribution
With very low p, the expected number of successes (reads mapping to a lowly expressed gene) is minuscule compared to n
Calculations with very small probabilities may lead to numerical underflow/imprecise results
Poisson distribution: A tractable model for large discrete counts
The Poisson distribution is a statistical tool used to model the number of events (i.e., counts) that happen in a fixed period of time or space, where:
- The events are independent of each other
- Each event has a constant average rate (i.e., allows variation between events)
Expected average of X
Number of events or counts
Probability

Assuming the constant average rate of success allows some variation around the mean
I.e., sample variation and batch effects
After today, you should have a better understanding of
Reliable statistical models for gene expression data
Negative binomial distribution
Poisson distribution assumes mean and variance are equal
The expected value (i.e., mean)
When k = 0, the term is zero
Use
You don't need to understand these derivations—just the outcome
Poisson distribution assumes mean and variance are equal
When k = 0 or 1, the term is zero
You don't need to understand these derivations—just the outcome
Use
If our variance is different from our mean, our Poisson model breaks down
Parity plots with mean and variance show deviations with Poisson distributions
Higher counts typically have a larger variance

Count mean
Count variance
Mean = variance line
Overdispersion in RNA-Seq
Overdispersion: It happens when the variance in the data is larger than what is predicted by simpler models (e.g., Poisson distribution)
- Expected variance for Poisson-distributed data equals the mean: Variance=μ\text{Variance} = \muVariance=μ
- Variance is often larger than the mean for RNA-Seq: Variance>μ\text{Variance} > \muVariance>μ

Overdispersion may reflect biological variability between samples not captured by the experimental conditions
- Differences in RNA quality
- sequencing depth,
- biological factors like different cell types within the same tissue
Poisson distribution is unsuitable for RNA-seq data because of high noise


Negative Binomial distribution accounts for high dispersion
Observed number of counts
Mean or expected value of counts
Dispersion parameter, controlling how much the variance exceeds the mean
Gamma function, which generalizes the factorial to floats
If α=0\alpha = 0α=0, the Negative Binomial distribution reduces to the Poisson distribution
The challenge of zeros in RNA-seq data
RNA-seq data frequently contains zero counts for some genes because not all genes are expressed under all conditions
Most statistical models account for variance, but not that zeros can dominate counts
For example, if we have a high expected mean with Poisson distribution we can still have zeros or very low counts

In these circumstances, we have to use zero-inflated models
We will ignore these for now
After today, you should have a better understanding of
Fitting statistical models
Likelihood quantifies the probability of the observed data given a model
A higher product (or joint likelihood) means the model assigns a higher probability to the observed data, indicating a better fit.
The likelihood of model parameters θ\thetaθ given data y\mathbf{y}y is defined as
When individual data points y1,y2,…,yny_1, y_2, \dots, y_ny1,y2,…,yn are independent, the joint probability is calculated by multiplying their individual probabilities:
Multiplying these probabilities aggregates the evidence from each data point, providing a comprehensive measure of how well the model with parameter θ\thetaθ fits all the data
The log transformation simplifies computation and interpretation
logL(θ)=∑i=1nlogP(yi∣θ)\log L(\theta) = \sum_{i=1}^n \log P(y_i|\theta)
Makes differentiation easier for optimization
Log likelihood
logL(θ)=∑i=1nlogP(yi∣θ)\log L(\theta) = \sum_{i=1}^n \log P(y_i|\theta)
Converts products into sums, reducing computational issues.
Maximum Likelihood Estimation (MLE) finds the parameters that maximize the log likelihood
At the optimum, the model parameters provide the best explanation of the observed data.
Optimization problem

Bad fit
Good fit
After today, you should have a better understanding of
Statistical tests
Likelihood Ratio Test
To compute a p-value, a likelihood ratio test (LRT) can be usedβ^1\hat{\beta}_1
The idea is to compare the likelihood of the data under
- the null model (same expression in both conditions)
- the alternative model (different expression levels in each condition)β^1\hat{\beta}_1
Log-Likelihood of Negative Binomial
For each condition, you compute the log-likelihoods:
LRT Statistic
The LRT statistic is:β^1\hat{\beta}_1
The log-likelihood under the null hypothesis (assuming a common mean μ0\mu_0μ0 for both conditions)
The LRT statistic approximately follows a chi-squared distribution with 1 degree of freedom under the null hypothesis
The p-value is computed as:

k would be 1
Before the next class, you should
Lecture 08B:
Differential gene expression -
Methodology
Lecture 08A:
Differential gene expression -
Foundations
Today
Thursday
Example: Triple-negative breast breast cancer
Objective: Identify genes differentially expressed between triple-negative breast cancer (TNBC) and hormone receptor-positive breast cancer

Differential gene expression provides statistical tools to identify changes between samples
Findings: TNBC shows upregulation of genes involved in cell proliferation and metastasis.
Implications:
- Targets for specific therapies.
- Improved classification and prognosis of breast cancer subtypes.
Wald’s Test for Gene Expression Differences
-
Wald’s Test: A statistical test that helps us determine whether the estimated log fold change between two conditions is significantly different from zero.
-
Null Hypothesis (H₀): The log fold change between conditions is zero (no difference in expression between the conditions).
- Log Fold Change (β₁) = 0 means that the gene is expressed at the same level in both conditions.
-
Alternative Hypothesis (H₁): The log fold change between conditions is not zero (there is a difference in expression).
Estimate Parameters from the Negative Binomial Model
For each gene, the Negative Binomial model gives us an estimated log fold changeβ^1\hat{\beta}_1
It also gives us a standard error (SE) for this estimate, which tells us how uncertain we are about the estimate of log fold changeβ^1\hat{\beta}_
The Wald statistic is calculated as
This statistic tells us how many standard deviations the estimated log fold change is away from zero (no difference)
Log Fold Change
- Positive Log Fold Change: Indicates higher expression in the condition of interest (e.g., diseased).
- Negative Log Fold Change: Indicates lower expression in the condition of interest.
- Log Fold Change of Zero: Means no difference between conditions.