Benchmarking and methods development for single cell data
PhD defense
Almut Lütge
Zürich, 28.06.23
single cell (transcriptome) data
Cells are the basic unit of life
Cells are the basic unit of life
Development
Stem cell
Differentiated cells
Tissue
Cells are the basic unit of life
Development
Stem cell
Differentiated cells
Tissue
Disease
Cancer
Immune system
Cells are regulated at different molecular levels
Cells are regulated at different molecular levels
Cells are regulated at different molecular levels
Transcription
Cells are regulated at different molecular levels
Transcription
Translation
From reads to gene expression profiles
aatgctgcgctaatcgcgcgtatcgggatcatgccctagtggccccatattggcgtcaggtcgaacggatcttcggtgactccatgcattttcaggctcactgtggca
From reads to gene expression profiles
aatgctgcgctaatcgcgcgtatcgggatcatgccctagtggccccatattggcgtcaggtcgaacggatcttcggtgactccatgcattttcaggctcactgtggca
alignment, filtering, counting
From reads to gene expression profiles
aatgctgcgctaatcgcgcgtatcgggatcatgccctagtggccccatattggcgtcaggtcgaacggatcttcggtgactccatgcattttcaggctcactgtggca
alignment, filtering, counting
filtering,
QC, normalization
embedding
From reads to gene expression profiles
aatgctgcgctaatcgcgcgtatcgggatcatgccctagtggccccatattggcgtcaggtcgaacggatcttcggtgactccatgcattttcaggctcactgtggca
alignment, filtering, counting
filtering,
QC, normalization
embedding
clustering
trajectory
marker genes
differentiation
From reads to gene expression profiles
single cell tools are on the rise
single cell tools are on the rise
Batch effects
Batch effects: Definition
Differences between data sets [..][that] occur due to uncontrolled variability in experimental factors (Lun, 2019)
Batch effects: Definition
Differences between data sets [..][that] occur due to uncontrolled variability in experimental factors (Lun, 2019)
- context-dependent
Batch effects: Definition
Differences between data sets [..][that] occur due to uncontrolled variability in experimental factors (Lun, 2019)
- context-dependent
- risk to mask biological signals or create spurious pattern
Example: No batch effect
Batch 1
Batch 2
Example: No batch effect
Batch 1
Batch 2
tsne1
tsne2
Example: batch effect
Batch 1
Batch 2
Example: batch effect
Batch 1
Batch 2
tsne1
tsne2
Example: cell types + No batch effect
Batch 1
Batch 2
Example: cell types + No batch effect
Batch 1
Batch 2
tsne1
tsne2
Batch 1
Batch 2
Example: cell types + batch effect
Batch 1
Batch 2
tsne1
tsne2
Example: cell types + batch effect
Example: batch effect
Batch 1
Batch 2
tsne1
tsne2
tsne1
tsne2
batch Integration (correction)
Batch 2
Batch 1
Common embedding
dim1
dim2
single cell tools are on the rise
Batch mixing metrics:
How to quantify batch effects? Comparison of different batch mixing metrics
How to quantify batch effects?
- quantify "mixing" of batches
- Different levels: global, cell type, cell
How to quantify batch effects?
- quantify "mixing" of batches
- Different levels: global, cell type, cell
How to quantify batch effects?
- quantify "mixing" of batches
- Different levels: global, cell type, cell
Cellspecific mixing score (cms): No batch effect
Cellspecific mixing score (cms): No batch effect
Cellspecific mixing score (cms): No batch effect
Cellspecific mixing score (cms): No batch effect
CMS scales with batch randomness
Task 1: Scaling and sensitivity
Aim: Test whether metrics scale with (synthetic) batch strength; Estimate lower limit of batch detection
Spearman correlation of metrics with the batch logFC in simulation series on the same dataset; Minimal batch logFC that is recognized from the metrics as batch effect
Task 1: increasing batch strength
Task 1: increasing batch strength
Metrics vary in their batch detection ranges
batch mixing metrics are context-dependent
Benchmarking
Systematic performance comparisons
Datasets
Systematic performance comparisons
Datasets
Methods
Systematic performance comparisons
Datasets
Methods
Metrics
Systematic performance comparisons
Datasets
Methods
Metrics
benchmarks to guide method choice
The self assessment trap
The self assessment trap
1. cms_default 2. cms_kmin 3. lisi
The self assessment trap
1. cms_default 2. cms_kmin 3. lisi
Norel et al, 2011
Benchmarking results can be ambiguous
Luecken et al., 2021
Different benchmarks - different conclusions
- Scanorama
- Conos
- Harmony
- Limma
- Combat
- Liger
- Seurat
- Harmony
-
- Seurat
- Harmony
- Scanorama
- Liger
- TrVAE
- Seurat
"benchmarking [..] is comparable to asking how good a baseball player is by testing how quickly he or she hits or runs under very controlled circumstances." Kasper Lage, 2020
Do benchmarks reflect reality?
"benchmarking [..] is comparable to asking how good a baseball player is by testing how quickly he or she hits or runs under very controlled circumstances." Kasper Lage, 2020
Robert Lewandowski
max. speed: 32.71 km/h
Timo Werner
max. speed: 34.1 km/h
Do benchmarks reflect reality?
"benchmarking [..] is comparable to asking how good a baseball player is by testing how quickly he or she hits or runs under very controlled circumstances." Kasper Lage, 2020
Robert Lewandowski
max. speed: 32.71 km/h
passes: 765
Timo Werner
max. speed: 34.1 km/h
passes: 660
Do benchmarks reflect reality?
"benchmarking [..] is comparable to asking how good a baseball player is by testing how quickly he or she hits or runs under very controlled circumstances." Kasper Lage, 2020
Robert Lewandowski
max. speed: 32.71 km/h
passes: 765
goals: 23
Timo Werner
max. speed: 34.1 km/h
passes: 660
goals: 9
Do benchmarks reflect reality?
status quo:
Meta-analysis of 62 method benchmarks in the field of single cell omics
62 single cell omics method benchmarks
62 single cell omics method benchmarks
Meta-analysis:
-
Title
-
Number of datasets used in evaluations:
-
Number of methods evaluated:
-
Degree to which authors are neutral:
...
22. Type of workflow system used:
62 single cell omics method benchmarks
2 reviewer per benchmark
Meta-analysis:
-
Title
-
Number of datasets used in evaluations:
-
Number of methods evaluated:
-
Degree to which authors are neutral:
...
22. Type of workflow system used:
62 single cell omics method benchmarks
2 reviewer per benchmark
Meta-analysis:
-
Title
-
Number of datasets used in evaluations:
-
Number of methods evaluated:
-
Degree to which authors are neutral:
...
22. Type of workflow system used:
independent harmonization of responses
62 single cell omics method benchmarks
2 reviewer per benchmark
Meta-analysis:
-
Title
-
Number of datasets used in evaluations:
-
Number of methods evaluated:
-
Degree to which authors are neutral:
...
22. Type of workflow system used:
independent harmonization of responses
summaries
Benchmark code is available but not extensible
Raw input data are available, but not results
Open and continuous benchmarking
Code
available
extensible
reusable
currently part of most benchmarks
not part of current standards
Open and continuous benchmarking
Data
inputs
simulations
results
Code
available
extensible
reusable
currently part of most benchmarks
not part of current standards
Open and continuous benchmarking
Data
inputs
simulations
results
Reproducibility
versions
environments
workflows
Code
available
extensible
reusable
currently part of most benchmarks
not part of current standards
Open and continuous benchmarking
Data
inputs
simulations
results
Reproducibility
versions
environments
workflows
scale
comprehensive
continuous
Code
available
extensible
reusable
currently part of most benchmarks
not part of current standards
Omnibenchmark:
open, continuous and collaborative benchmarking
A platform for collaborative and continuous benchmarking
Method developer/
Benchmarker
Method user
Methods
Datasets
Metrics
Omnibenchmark
Goals:
- continuous
- software environments
- workflows
- all "products" can be accessed
- anyone can contribute
Design: Benchmark modules
Data
standardized datasets
= 1 "module" (renku project )
Design: Benchmark modules
Data
standardized datasets
= 1 "module" (renku project )
Methods
method results
Design: Benchmark modules
Data
standardized datasets
= 1 "module" (renku project )
Methods
method results
Metrics
metric results
Design: Benchmark modules
Data
standardized datasets
= 1 "module" (renku project )
Methods
method results
Metrics
metric results
Dashboard
interactive result exploration
Design: Benchmark modules
Data
standardized datasets
= 1 "module" (renku project )
Methods
method results
Metrics
metric results
Dashboard
interactive result exploration
Method user
Method developer/
Benchmarker
modules are connected through data bundles
Data X
Data y
Data Z
= 1 "data bundle" (data files + meta data)
= 1 "module" (renku project )
modules are connected through data bundles
Data X
Data y
Data Z
= 1 "data bundle" (data files + meta data)
= 1 "module" (renku project )
modules are connected through data bundles
Data X
Data y
Data Z
= 1 "data bundle" (data files + meta data)
= 1 "module" (renku project )
process
modules are connected through data bundles
Data X
Data y
Data Z
= 1 "data bundle" (data files + meta data)
= 1 "module" (renku project )
process
Method 1
Method 2
modules are connected through data bundles
Data X
Data y
Data Z
= 1 "data bundle" (data files + meta data)
= 1 "module" (renku project )
process
Method 1
Method 2
Method 3
Benchmark runs are coordinated by an Orchestrator
Data X
Data y
Data Z
process
Method 1
Method 2
Method 3
Orchestrator
omnibenchmark allows flexible architectures
bettr: A better way to explore what is best
https://www.oecdbetterlifeindex.org
bettr: A better way to explore what is best
Omnibenchmark
Data
Reproducibility
scale
Code
part of omnibenchmark
not part of omnibenchmark
Omnibenchmark
Data
Reproducibility
scale
Code
available
extensible
reusable
part of omnibenchmark
not part of omnibenchmark
Omnibenchmark
Data
inputs
simulations
results
Reproducibility
scale
Code
available
extensible
reusable
part of omnibenchmark
not part of omnibenchmark
Omnibenchmark
Data
inputs
simulations
results
Reproducibility
versions
environments
workflows
scale
Code
available
extensible
reusable
part of omnibenchmark
not part of omnibenchmark
Omnibenchmark
Data
inputs
simulations
results
Reproducibility
versions
environments
workflows
scale
(comprehensive)
continuous
comparable
Code
available
extensible
reusable
part of omnibenchmark
not part of omnibenchmark
(Open) questions?
(Open) questions?
- Technical maturity for open and continuous benchmarking?
easy
flexible
(Open) questions?
- Technical maturity for open and continuous benchmarking?
- What about authorities and risks for biases?
easy
flexible
(Open) questions?
- Technical maturity for open and continuous benchmarking?
- Are we asking the "right" questions?
- What about authorities and risks for biases?
easy
flexible
(Open) questions?
- Technical maturity for open and continuous benchmarking?
- Are we asking the "right" questions?
- What about authorities and risks for biases?
- Who pays and for how long?
easy
flexible
Acknowledgement
omnibenchmark: current status
Against the ’one method fits all data sets’ philosophy
- there is not one way to benchmark different methods
- the same benchmark can provide different/multiple conclusions
--> Dynamic, extensible and explorable benchmarking system
Collaborative benchmarking: METHOD DEVELOPER
Collaborative benchmarking: bENCHMARKER
Collaborative benchmarking: mETHOD uSER
the knowledge graph: benchmark metadata are stored as triples
Result
Code
generated
used_by
has_attribute
keyword
has_attribute
keyword
Data
Code
Result
used_by
generated
User interaction with renku client
Automatic triplet generation
Triplet store "Knowledge graph"
User interaction with renku client
KG-endpoint queries
Omnibenchmark components
contributer
user
omnibenchmark-python
omniValidator
benchmarker
projects
templates
omb-site
{
orchestrator
triplestore
omni-sparql
dashboards
Orchestrator coordinate benchmark runs
GitLab
Docker
Workflow
Module:
Template code
Module code
Data bundle
GitLFS/S3
Input/Output files
Metadata
Workflow manager are rarely used
Batch effects: example
tsne1
tsne2
Batch effects: Why do we care?
Batch effects: Why do we care?
Batch effects: characteristics
Metrics vary in their batch detection ranges
Task 2: Batch label permutation
Aim: Negative control and test whether metrics scale with randomness
Spearman correlation of metrics with the percentage of randomly permuted batch label
Task 2: increasing randomness
Most metrics scale with increasing randomness
Task 3: Batch characteristics
Aim: Test whether metrics reflect batch strength across datasets
Spearman correlation of metrics with surrogates of batch strength (e.g., percent variance explained by batch (PVE-Batch) and proportion of DE genes between batches) across datasets
Task 3: batch characteristics
Variance attribution
Batch DE genes
Across batch comparisons are limited
Task 4: Imbalanced batches
Aim: Reaction of metrics to imbalanced cell type abundance within the same dataset
Test sensitivity towards imbalance of cell type abundance
Task 4: Imbalanced batches
Batch strength and Imbalance can be mixed
spectrometry
imaging
aatgctgcgctaatcgcgcgta
tcgggatcatgccctagtggcc
cgccatattggcgtcaggtcga
atcggatccggtgactccatgc
atttcaggctcactgtggcacc
sequencing
How to study Cells at molecular resolution ?
Cells are regulated at different molecular levels
Which method should I use?
Luecken et al., 2021
Defense
By Almut Luetge
Defense
- 94