Case-control analysis of single-cell RNA-seq studies

Petukhov V, Rydbirk R, Igolkina A, Mei S, Kharchenko P, Khodosevich K

viktor.petukhov@pm.me

Description of the problem

Two conditions, multiple samples per condition, multiple cells per sample

Case samples

Control samples

Description of the problem

What cell types are affected and how?

Prepare for further experiments:

  • Which subtypes should we focus on?
  • Which genes per subtype should we investigate further?

Questions to existing data:

  • Did some cell types change their expression in a similar way?
  • Which genes changed their expression in a similar way?
  • All other patterns in expression changes we can think of

Existing solutions

Existing solutions

Align samples

scVI, Conos, ..., Seurat

See the review from the Theis lab

Existing solutions

Align samples

Perform joint annotation

Existing solutions

Align samples

Perform joint annotation

Run differential expression 

Existing solutions

Align samples

Perform joint annotation

Run differential expression 

Run Gene Ontology analysis

Existing solutions

Align samples

Perform joint annotation

Run differential expression 

Run Gene Ontology analysis

Compare cell type proportions

Existing solutions

Align samples

Perform joint annotation

Run differential expression 

Run Gene Ontology analysis

Compare cell type proportions

Why it is not enough

  • What cell types are the most affected?
  • How exactly are they affected?

The main questions:

Why it is not enough

Number of DE genes is not an answer

  • What cell types are the most affected?
  • How exactly they are affected?

The main questions:

Why it is not enough

There are up to 1000 significant DE genes per type and 2795 unique DE genes in total

  • What cell types are the most affected?
  • How exactly they are affected?

The main questions:

Why it is not enough

There are up to 1000 significant DE genes per type and 2795 unique DE genes in total

  • What cell types are the most affected?
  • How exactly they are affected?

The main questions:

There are up to 300 significant GO terms per type and 796 unique terms in total

Why it is not enough

DE depends on the depth and quality of the annotation

Compositional analysis

Gene expression analysis

Case-control analysis of single-cell studies: a fresh approach

What can we possibly do?

What can we possibly do?

Gene expression analysis

Compositional analysis

Cluster-based

Cluster-free

control

epilepsy

Case-control analysis of single-cell studies: a fresh approach

Composition analysis, cluster-based

Composition analysis, cluster-based

Problem: changes are not independent

Composition analysis, cluster-based

Composition analysis, cluster-free

Control

Multiple sclerosis

Composition analysis, cluster-free

Control

MS

Z-scores

Embedding

Graph

Composition analysis, cluster-free

Multiple-comparison adjustment

  • Measurements are highly-correlated
  • Large difference in one cell is good
  • Large difference in many cell is even better
  • Increasing number of cells should not decrease p-values

Composition analysis, cluster-free

Multiple-comparison adjustment

  1. Permute condition labels per sample ~200 times and estimate some difference statistic under null
     
  2. Apply graph smoothing
     
  3. Winsorize (1%) the scores
     
  4. Estimate min and max of the scores per permutation
     
  5. Approximate the null distributions (for minimums and maximums) with KDE
     
  6. P-value for each observed score is the tail probability of the null distribution for the observed statistic values

Composition analysis, cluster-free

Control

MS

Embedding

Graph

Adjusted z-scores

Z-scores

Expression analysis, cluster-based

What cell types are affected the most?

Expression analysis, cluster-based

What cell types are affected the most?

Expression analysis, cluster-based

What cell types are affected the most?

Expression analysis, cluster-based

What cell types are affected the most?

Z = \frac{d_{between}}{\overline{d}_{control}}

Expression analysis, cluster-based

What cell types are affected the most?

Expression analysis, cluster-based

Visualization of sample structure

Expression analysis, cluster-based

Visualization of sample structure

Expression analysis, cluster-based

Visualization of sample structure

Expression analysis, cluster-free

Differential expression on single cells

Expression analysis, cluster-free

Differential expression on single cells

Expression analysis, cluster-free

Differential expression on single cells

control

epilepsy

Expression analysis, cluster-free

Differential expression on single cells

control

epilepsy

Differential expression

Aggregate by samples

Expression analysis, cluster-free

Differential expression on single cells

Expression analysis, cluster-free

Differential expression on single cells

Expression analysis, cluster-free

Gene programs on single cells

Program 1

Expression analysis, cluster-free

Gene programs on single cells

Program 2

Expression analysis, cluster-free

Expression distances

Probably can be improved by estimating adjusted z-scores

Differential expression stability

*Done by Anna and Rasmus

Gene expression analysis

Compositional analysis

Cluster-based

Cluster-free

Summary

Cacoa, PM PK May 2021

By Viktor Petukhov

Cacoa, PM PK May 2021

  • 711