Computational Biology
(BIOSC 1540)
Sep 26, 2024
Lecture 10:
Differential gene expression
Differential Gene Expression (DGE): The process of identifying and quantifying changes in gene expression levels between different sample groups or conditions
Objective: Identify genes differentially expressed between triple-negative breast cancer (TNBC) and hormone receptor-positive breast cancer
Findings:
Implications:
Differential gene expression provides statistical tools to identify changes between samples
A statistical model is a mathematical tool that describes how data are generated
Gene expression
Normal
Cancerous
It helps us answer:
Statistical models help us make sense of complex data by identifying patterns and determining whether differences are meaningful or just due to chance
After fitting a statistical model, we need to perform hypothesis testing to see if the difference in expression between conditions is statistically significant
Null Hypothesis (H₀): There is no difference in gene expression between the two conditions
Gene expression
Normal
Cancerous
Alternative Hypothesis (H₁): There is a significant difference in gene expression between the conditions
We have two hypotheses:
We reject the null hypothesis when our statistical test demonstrates that the observed difference, if any, is unlikely to have happened by random chance
What is the probability that any difference is either (1) nonexistent or (2) due to random chance (i.e., "getting lucky")
Probability value (p-value):
The higher the p-value, the more our model supports the null hypothesis
The lower the p-value, the more our model supports the alternative hypothesis
Gene expression
Normal
Cancerous
Gene expression
Normal
Cancerous
Ensures that we are not biasing our data or our interpretation
RNA-seq generates count data – the number of RNA fragments that map to each gene
Gene expression
Normal
Cancerous
Example: 573,282 TPM
Discrete data requires us to use special statistical tools
What is discrete data:
For example, you cannot use a normal distribution because it requires continuous data
The Binomial distribution models the number of successes in a fixed number of independent trials, where each trial has the same probability of success
Number of trials
Number of successes
Probability
Probability of success
RNA-seq analogy: Each read can be considered a "trial," and the probability that a read maps to a specific gene is the "probability of success."
The Poisson distribution simplifies computation and allows for varying probabilities
Computations with low p and high n are computationally demanding
The Poisson distribution is a statistical tool used to model the number of events (or counts) that happen in a fixed period of time or space, where:
Expected average of X
Number of events or counts
Probability
Provides an accurate distribution of counts if your mean and variance are approximately equal
RNA-seq data are noisy (i.e., high variance) and incompatible with Poisson distribution
Higher counts typically have a larger variance
Count mean
Count variance
Mean = variance line
Overdispersion: It happens when the variance in the data is larger than what is predicted by simpler models (e.g., Poisson distribution)
Overdispersion may reflect biological variability between samples not captured by the experimental conditions
Observed number of counts
Mean or expected value of counts
Dispersion parameter, controlling how much the variance exceeds the mean
Gamma function, which generalizes the factorial to floats
If α=0\alpha = 0α=0, the Negative Binomial distribution reduces to the Poisson distribution
RNA-seq data frequently contains zero counts for some genes because not all genes are expressed under all conditions
Most statistical models account for variance, but not that zeros can dominate counts
For example, if we have a high expected mean with Poisson distribution we can still have zeros or very low counts
In these circumstances, we have to use zero-inflated models
We will ignore these for now
RNA-seq data is messy: counts vary, there are lots of zeros, and data doesn’t follow simple patterns
We need models to account for this complexity and figure out which genes are differentially expressed in a meaningful way
A statistical model predicts each sample's count data
(number of reads mapping to each gene)
MLE tries to find the model parameters that make the observed counts most likely
It does this by adjusting the model until the predicted counts match the actual counts as closely as possible (i.e., minimize the error)
Wald’s Test: A statistical test that helps us determine whether the estimated log fold change between two conditions is significantly different from zero.
Null Hypothesis (H₀): The log fold change between conditions is zero (no difference in expression between the conditions).
Alternative Hypothesis (H₁): The log fold change between conditions is not zero (there is a difference in expression).
For each gene, the Negative Binomial model gives us an estimated log fold changeβ^1\hat{\beta}_1
It also gives us a standard error (SE) for this estimate, which tells us how uncertain we are about the estimate of log fold changeβ^1\hat{\beta}_
The Wald statistic is calculated as
This statistic tells us how many standard deviations the estimated log fold change is away from zero (no difference)
To compute a p-value, a likelihood ratio test (LRT) can be usedβ^1\hat{\beta}_1
The idea is to compare the likelihood of the data under
Log-Likelihood of Negative Binomial
For each condition, you compute the log-likelihoods:
The LRT statistic is:β^1\hat{\beta}_1
The log-likelihood under the null hypothesis (assuming a common mean μ0\mu_0μ0 for both conditions)
The LRT statistic approximately follows a chi-squared distribution with 1 degree of freedom under the null hypothesis
The p-value is computed as:
k would be 1
Interpretation:
A volcano plot displays the relationship between each gene's statistical significance (p-value) and the magnitude of change (fold change).
An MA plot visualizes the relationship between the average expression (A) and the log fold change (M) for each gene.
Interpretation:
Usage: Identifying trends or biases in expression data, such as mean-dependent variance.
Components:
A heatmap displays the expression levels of multiple genes across different samples using color gradients
Interpretation: Identifying clusters of co-expressed genes and sample groupings based on expression profiles.
PCA transforms high-dimensional gene expression data into principal components that capture the most variance
Axes: Principal components representing the most significant sources of variation
Interpretation:
Usage: Assessing batch effects, overall data structure, and sample quality
Review
Lecture 10:
Differential gene expression
Today
Tuesday