Case-control analysis of single-cell RNA-seq studies

Petukhov V, Rydbirk R, Igolkina A, Mei S, Kharchenko P, Khodosevich K

viktor.petukhov@pm.me

Description of the problem

Two conditions, multiple samples per condition, multiple cells per sample

Case samples

Control samples

Description of the problem

What cell types are affected and how?

Prepare for further experiments:

Which subtypes should we focus on?
Which genes per subtype should we investigate further?

Questions to existing data:

Did some cell types change their expression in a similar way?
Which genes changed their expression in a similar way?
All other patterns in expression changes we can think of

Existing solutions

*Integrating single-cell transcriptomic data across different conditions, technologies, and species, Butler A, et al. Nature Biotechnology, 2018

Existing solutions

Align samples

scVI, Conos, ..., Seurat

See the review from the Theis lab

*Integrating single-cell transcriptomic data across different conditions, technologies, and species, Butler A, et al. Nature Biotechnology, 2018

Existing solutions

Align samples

Perform joint annotation

*Integrating single-cell transcriptomic data across different conditions, technologies, and species, Butler A, et al. Nature Biotechnology, 2018

Existing solutions

Align samples

Perform joint annotation

Run differential expression

*Single-cell genomics identifies cell type–specific molecular changes in autism, Velmeshev D, et al. Science, 2019

Existing solutions

Align samples

Perform joint annotation

Run differential expression

Run Gene Ontology analysis

*Single-cell genomics identifies cell type–specific molecular changes in autism, Velmeshev D, et al. Science, 2019

Existing solutions

Align samples

Perform joint annotation

Run differential expression

Run Gene Ontology analysis

Compare cell type proportions

*A single-cell atlas of entorhinal cortex from individuals with Alzheimer’s disease reveals cell-type-specific gene expression regulation, Grubman A, Nature Neuroscience, 2019

Existing solutions

Align samples

Perform joint annotation

Run differential expression

Run Gene Ontology analysis

Compare cell type proportions

Why it is not enough

What cell types are the most affected?
How exactly are they affected?

The main questions:

Why it is not enough

Number of DE genes is not an answer

What cell types are the most affected?
How exactly they are affected?

The main questions:

Why it is not enough

There are up to 1000 significant DE genes per type and 2795 unique DE genes in total

*Pfisterer, U., Petukhov, V., Demharter, S. et al. Identification of epilepsy-associated neuronal subtypes and gene expression underlying epileptogenesis. Nat Commun 11, 5038 (2020).

What cell types are the most affected?
How exactly they are affected?

The main questions:

Why it is not enough

There are up to 1000 significant DE genes per type and 2795 unique DE genes in total

What cell types are the most affected?
How exactly they are affected?

The main questions:

There are up to 300 significant GO terms per type and 796 unique terms in total

*Pfisterer, U., Petukhov, V., Demharter, S. et al. Identification of epilepsy-associated neuronal subtypes and gene expression underlying epileptogenesis. Nat Commun 11, 5038 (2020).

Why it is not enough

DE depends on the depth and quality of the annotation

Compositional analysis

Gene expression analysis

Case-control analysis of single-cell studies: a fresh approach

What can we possibly do?

Gene expression analysis

Compositional analysis

Cluster-based

Cluster-free

control

epilepsy

Case-control analysis of single-cell studies: a fresh approach

Composition analysis, cluster-based

Problem: changes are not independent

Composition analysis, cluster-based

*Schirmer, L., Velmeshev, D., Holmqvist, S. et al. Neuronal vulnerability and multilineage diversity in multiple sclerosis. Nature 573, 75–82 (2019)

Composition analysis, cluster-free

Control

Multiple sclerosis

*Schirmer, L., Velmeshev, D., Holmqvist, S. et al. Neuronal vulnerability and multilineage diversity in multiple sclerosis. Nature 573, 75–82 (2019)

Composition analysis, cluster-free

*Schirmer, L., Velmeshev, D., Holmqvist, S. et al. Neuronal vulnerability and multilineage diversity in multiple sclerosis. Nature 573, 75–82 (2019)

Control

Z-scores

Embedding

Graph

Composition analysis, cluster-free

Multiple-comparison adjustment

Measurements are highly-correlated
Large difference in one cell is good
Large difference in many cell is even better
Increasing number of cells should not decrease p-values

Composition analysis, cluster-free

Multiple-comparison adjustment

Permute condition labels per sample ~200 times and estimate some difference statistic under null
Apply graph smoothing
Winsorize (1%) the scores
Estimate min and max of the scores per permutation
Approximate the null distributions (for minimums and maximums) with KDE
P-value for each observed score is the tail probability of the null distribution for the observed statistic values

*Nichols TE, Holmes AP. Nonparametric permutation tests for functional neuroimaging: a primer with examples. Hum Brain Mapp. 2002;15(1):1-25. doi:10.1002/hbm.1058