Transparent, adaptable, and reproducible data analyses for oncology - Snakemake, Datavzrd, Vembrane and beyond

Johannes Köster

2026

University of Duisburg-Essen

check computational validity
apply same analysis to new data

check methodological validity
understand analysis

Data analysis

Reproducibility

Transparency

modify analysis
extend analysis

Adaptability

>1.6 million downloads since 2015

>3000 citations

>14 citations per week in 2024

dataset

results

dataset

Define workflows

in terms of rules

Define workflows

in terms of rules

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    script:
        "scripts/myscript.R"


rule myfiltration:
     input:
        "result/{dataset}.txt"
     output:
        "result/{dataset}.filtered.txt"
     shell:
        "mycommand {input} > {output}"


rule aggregate:
    input:
        "results/dataset1.filtered.txt",
        "results/dataset2.filtered.txt"
    output:
        "plots/myplot.pdf"
    script:
        "scripts/myplot.R"

Define workflows

in terms of rules

Define workflows

in terms of rules

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        "result/{sample}.txt"
    shell:
        "some-tool {input} > {output}"

rule name

how to create output from input

define

input
output
log files
parameters
resources

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    script:
        "scripts/myscript.R"


rule myfiltration:
     input:
        "result/{dataset}.txt"
     output:
        "result/{dataset}.filtered.txt"
     shell:
        "mycommand {input} > {output}"


rule aggregate:
    input:
        "results/dataset1.filtered.txt",
        "results/dataset2.filtered.txt"
    output:
        "plots/myplot.pdf"
    script:
        "scripts/myplot.R"

Automatic inference of DAG of jobs

Boilerplate-free integration of scripts

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        "result/{sample}.txt"
    script:
        "scripts/myscript.py"

reusable scripts:

Python
R
Julia
Rust
Bash
Xonsh
Hy

Reusable wrappers

rule map_reads:
    input:
        "{sample}.bam"
    output:
        "{sample}.sorted.bam"
    wrapper:
        "0.22.0/bio/samtools/sort"

reuseable wrappers from central repository

Reusable wrappers

Self-contained HTML reports

Plugins

Many more features

dynamic DAG rewiring
service jobs (providing sockets, loading databases, or ramdisks)
semantic helper functions for minimizing boilerplate code
fallible rules
caching of shared results across workflows
transparent handling of remote storage

Tables are the central entity
in data analysis

https://datavzrd.github.io

Not always a single table

oncoprint + individual variant calls

differentially expressed genes + expression matrix

https://datavzrd.github.io

Not always just a table

oncoprint + individual variant calls +

differentially expressed genes + expression matrix +

https://datavzrd.github.io

State of the art

Individual tables (tsv, excel) and plots:

easy to publish
limited interactivity
no jumping between corresponding items

Web applications (custom, shiny, ...):

running server (or local installation)
implementation overhead
long-term maintenance is challenging

https://datavzrd.github.io

The problem

Input:

set of tables
relations between tables
set of rendering definitions

Output:

portable interactive visual presentation

https://datavzrd.github.io

Datavzrd

https://datavzrd.github.io

Datavzrd

https://datavzrd.github.io

Datavzrd + Snakemake

rule datavzrd:
    input:
        config="resources/{sample}.datavzrd.yaml",
        table="data/{sample}.tsv",
    output:
        report(
            directory("results/datavzrd-report/{sample}"),
            htmlindex="index.html",
        ),
    wrapper:
        "v4.6.0/utils/datavzrd"

https://datavzrd.github.io

Vembrane

harmonize

variant filtration
variant prioritization
variant exchange (FHIR)

https://vembrane.github.io

(not {'risk_factor','pathogenic', 'drug_response'}.isdisjoint(ANN['CLIN_SIG'])) and
(ANN['IMPACT'] in {'LOW', 'MODERATE', 'HIGH'}) and
(ANN['REVEL'] is NA or ANN['REVEL'] >= 0.5)

pathogenic/risk-factor/drug-response (vembrane filter)

relevant VAFs (vembrane filter)

(FORMAT['AF']['tumor'] >= 0.05 and (ANN['gnomADg_AF'] is NA or ANN['gnomADg_AF'] < 0.01))

desc(quantize(max(FORMAT['AF'][sample] for sample in SAMPLES), 1 / 3)), \
desc(ANN["REVEL"]), \
desc(FORMAT['AF']['tumor'])

sort by descending relevance (vembrane sort)

Vembrane

Adaptable and unified small-variant/structural-variant/fusion calling on any scenario

Snakemake + Datavzrd + Vembrane + Varlociraptor =

reference pipeline project

samples:
  jane:
    sex: female
    somatic-effective-mutation-rate: 1e-10
  tumor:
    inheritance:
      clonal:
        from: jane
    contamination:
      by: jane
      fraction: 0.1
    somatic-effective-mutation-rate: 1e-6
  relapse:
    inheritance:
      clonal:
        from: jane
    contamination:
      by: jane
      fraction: 0.2
    somatic-effective-mutation-rate: 1e-6

expressions:
  somatic_tumor: "jane:0.0 & tumor:]0.0,1.0]"

events:
  germline: "jane:0.5 | jane:1.0"
  somatic: "jane:]0.0,0.5["
  somatic_tumor_no_increase: "$somatic_tumor & l2fc(relapse,tumor) < 1"
  somatic_tumor_increase: "$somatic_tumor & l2fc(relapse,tumor) >= 1"
  somatic_relapse: "jane:0.0 & tumor:0.0 & relapse:]0.0,1.0]"

https://varlociraptor.github.io

varlociraptor

Dual use in research and clinics

Conclusion

Snakemake + Datavzrd

generic, transparent, human readable, adaptable, portable, and scalable data analysis and visualization

Vembrane

Unified DSL for filtering, sorting and transforming genomic variants/fusions/CNVs

Varlociraptor + Pipeline

unified, adaptable small/structural variant calling for any scenario

deck

By Johannes Köster

Transparent, adaptable, and reproducible data analyses for oncology - Snakemake, Datavzrd, Vembrane and beyond

Data analysis

Define workflows

in terms of rules

Define workflows

in terms of rules

Define workflows

in terms of rules

Define workflows

in terms of rules

Automatic inference of DAG of jobs

Boilerplate-free integration of scripts

Reusable wrappers

Reusable wrappers

Self-contained HTML reports

Plugins

Many more features

Tables are the central entity in data analysis

Not always a single table

Not always just a table

State of the art

The problem

Datavzrd

Datavzrd

Datavzrd + Snakemake

Vembrane

Vembrane

Snakemake + Datavzrd + Vembrane + Varlociraptor =

varlociraptor

Dual use in research and clinics

Conclusion

deck

deck

Johannes Köster PRO

More from Johannes Köster

Tables are the central entity
in data analysis