Omnibenchmark:

Challenges and opportunities of open and continuous community benchmarking

DMLS seminar,

Almut Lütge - Robinson group,

Zürich, 18.10.2022

benchmark:

Systematic comparison of methods/processes to understand underlying features and/or find the most suitable procedure for a specific task

single cell tools are on the rise

Number of tools per year

https://www.scrna-tools.org/

Comparisons can be ambiguous

Luecken et al., 2020

Metrics benchmark:

Systematic comparison of metrics to understand their performance and find the most suitable metric to evaluate batch correction

Task 1: Scaling and detection limits

Aim: Test whether metrics scale with (synthetic) batch strength; Estimate lower limit of batch detection

Spearman correlation of metrics with the batch logFC in simulation series on the same dataset; Minimal batch logFC that is recognized from the metrics as batch effect

Increasing batch log fold changes

Metrics have different ranges to detect and distinguish batch effects

Metrics have different ranges to detect and distinguish batch effects

Task 2: Batch label permutation

Aim: Negative control and test whether metrics scale with randomness

Spearman correlation of metrics with the percentage of randomly permuted batch label

Increased percentage of Randomly permuted batch label

Most Metrics scale with label randomness

Task 3: Batch characteristics

Aim: Test whether metrics reflect batch strength across datasets

Spearman correlation of metrics with surrogates of batch strength (e.g., percent variance explained by batch (PVE-Batch) and proportion of DE genes between batches) across datasets

Percent variance explained by ..

Percent variance explained by ..

Not all Metrics reflect batch characteristics

Task 4: Imbalanced batches

Aim: Reaction of metrics to imbalanced cell type abundance within the same dataset

Test sensitivity towards imbalance of cell type abundance

Increased Imbalance of the batch effect

Imbalanced batch effects affect most local metrics scores

imbalance limits of different metrics

summary Metrics benchmark:

Comparisons can be ambiguous

Luecken et al., 2020

Comparisons can be ambiguous

Luecken et al., 2020

Static benchmarks are limited in scope and interpretability

There is not one way to benchmark different methods

There are multiple ways to interpret benchmark results

--> Open extensible community benchmarks

status quo:

Meta-analysis of 62 method benchmarks in the field of single cell omics

62 single cell omics method benchmarks

2 reviewer per benchmark

Meta-analysis:

Title
Number of datasets used in evaluations:
Number of methods evaluated:
Degree to which authors are neutral:

...

22. Type of workflow system used:

independent harmonization of responses

summaries

Benchmark designs:

Often Benchmark code is available but not extensible

usually input data are available, but not results

Workflow manager are rarely used

Blocks of open and continuous benchmarking

Code

available

extensible

reusable

conclusion

neutral

community-driven

Reproducibility

code

workflows

enviroments

software versions

time-Scale

static

continuous

Open Data

input data

method results

simulations

performance results

currently part of most benchmarks

not part of current standards

Omnibenchmark:

Open and continuous community benchmarking

Omnibenchmark is a platform for open and continuous community driven benchmarking

Method developer/

Benchmarker

Method user

Methods

Datasets

Metrics

Omnibenchmark

continuous
self-contained modules
all "products" can be accessed
provenance tracking
anyone can contribute

Omnibenchmark design

Data

standardized datasets

= 1 "module" (renku project )

Methods

method results

Metrics

metric results

Dashboard

interactive result exploration

Method user

Method developer/

Benchmarker

Omnibenchmark modules are independent and self-contained

GitLab project

Docker container

Workflow

Datasets

Collection of

method* history,

description how to run it, comp. environment, datasets

components of omnibenchmark

Omnibenchmark-python

pypy module for workflow and dataset management with renku/KG

orchestrator

CICD Orchestrator to automatically run and update benchmarks

omnibenchmark-graph

Triplet store to perform cross repository queries

Bettr

Shiny app to interactively explore results

module Processes (Benchmark steps) are stored as triplets

Result

Code

Data

generated

used_by

Data

Code

Result

used_by

generated

Subject

predicate

Object

Triplet

modules are connected via knowledge Base

Module A

Module B

Triplet Store

triplet generation

bettr: A better way to explore what is best

https://www.oecdbetterlifeindex.org

bettr: A better way to explore what is best

Project status

2 continuous prototype benchmarks
Hackathon
Dashboard
Documentation
Templates

Acknowledgments

Robinson group

Mark Robinson

Anthony Sonrel

Izaskun Mallona

Pierre-Luc Germain

Renku team

Oksana Riba Grognuz

Friedrich Miescher institute

Charlotte Soneson

THANK YOU!

A data analysis platform/system built from a set of microservices

GitLab --> version control/CICD

Apache Jena --> Triple store

Jupyter server --> interactive sessions

Docker/Kubernetes --> software/enviroment management

GitLFS --> File storage

What is renku?

Renku client is a dataset and workflow management system

Renku client

Dataset and workflow management system → “renku-python”
Knowledge graph tracking → provenance

Renkulab

User interface with free interactive sessions
GitLab

Renku client is based on a Triplet store (Knowledge graph)

Result

Code

Data

generated

used_by

Data

Code

Result

used_by

generated

User interaction with renku client

Automatic triplet generation

Triplet store "Knowledge graph"

User interaction with renku client

KG-endpoint queries

orchestrator defines benchmark scopes

Each benchmark has their own orchestrator

Schedules automatic module updates

"Gate-keeping" - controls addition of new modules

DMLS_seminar

By Almut Luetge

Omnibenchmark:

benchmark:

single cell tools are on the rise

Comparisons can be ambiguous

Metrics benchmark:

Task 1: Scaling and detection limits

Increasing batch log fold changes

Metrics have different ranges to detect and distinguish batch effects

Metrics have different ranges to detect and distinguish batch effects

Task 2: Batch label permutation

Increased percentage of Randomly permuted batch label

Most Metrics scale with label randomness

Task 3: Batch characteristics

Percent variance explained by ..

Percent variance explained by ..

Not all Metrics reflect batch characteristics

Task 4: Imbalanced batches

Increased Imbalance of the batch effect

Imbalanced batch effects affect most local metrics scores

imbalance limits of different metrics

summary Metrics benchmark:

Comparisons can be ambiguous

Comparisons can be ambiguous

Static benchmarks are limited in scope and interpretability

status quo:

Benchmark designs:

Often Benchmark code is available but not extensible

usually input data are available, but not results

Workflow manager are rarely used

Blocks of open and continuous benchmarking

Code

conclusion

Reproducibility

time-Scale

Open Data

Omnibenchmark:

Omnibenchmark is a platform for open and continuous community driven benchmarking

Omnibenchmark design

Data

Methods

Metrics

Dashboard

Omnibenchmark modules are independent and self-contained

components of omnibenchmark

Omnibenchmark-python

orchestrator

omnibenchmark-graph

Bettr

module Processes (Benchmark steps) are stored as triplets

Triplet

modules are connected via knowledge Base

Triplet Store

bettr: A better way to explore what is best

bettr: A better way to explore what is best

Project status

Acknowledgments

THANK YOU!

What is renku?

Renku client is a dataset and workflow management system

Renku client

Renkulab

Renku client is based on a Triplet store (Knowledge graph)

orchestrator defines benchmark scopes

DMLS_seminar

More from Almut Luetge