Generic solutions for specific problems: Snakemake, Datavzrd, and Varlociraptor in 2025

Johannes Köster

Snakemake

>1 million downloads since 2015

>3000 citations

>14 citations per week in 2023

affiliated

Mölder, F., Jablonski, K.P., Letcher, B., Hall, M.B., Tomkins-Tinch, C.H., Sochat, V., Forster, J., Lee, S., Twardziok, S.O., Kanitz, A., Wilm, A., Holtgrewe, M., Rahmann, S., Nahnsen, S., Köster, J. Sustainable data analysis with Snakemake. F1000Res 10, 33 (2021).

https://snakemake.github.io

dataset

results

dataset

https://snakemake.github.io

Define workflows

in terms of rules

https://snakemake.github.io

Define workflows

in terms of rules

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    script:
        "scripts/myscript.R"


rule myfiltration:
     input:
        "result/{dataset}.txt"
     output:
        "result/{dataset}.filtered.txt"
     shell:
        "mycommand {input} > {output}"


rule aggregate:
    input:
        "results/dataset1.filtered.txt",
        "results/dataset2.filtered.txt"
    output:
        "plots/myplot.pdf"
    script:
        "scripts/myplot.R"

https://snakemake.github.io

Define workflows

in terms of rules

Define workflows

in terms of rules

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        "result/{sample}.txt"
    params:
        someparam=0.1
    conda:
        "envs/sometool.yaml"
    shell:
        "some-tool {params.someparam} {input} > {output}"

simple paths
helpers for branching
and aggregation
Python logic

shell commands
ad-hoc scripts and notebooks (Python, R, Julia, Rust, Bash)
wrappers

https://snakemake.github.io

deployment via

conda, containers

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    script:
        "scripts/myscript.R"


rule myfiltration:
     input:
        "result/{dataset}.txt"
     output:
        "result/{dataset}.filtered.txt"
     shell:
        "mycommand {input} > {output}"


rule aggregate:
    input:
        "results/dataset1.filtered.txt",
        "results/dataset2.filtered.txt"
    output:
        "plots/myplot.pdf"
    script:
        "scripts/myplot.R"

Automatic inference of DAG of jobs

https://snakemake.github.io

Scalable to any platform

workstation

compute server

cluster

grid computing

cloud computing

https://snakemake.github.io

Self-contained HTML reports

https://snakemake.github.io

Snakemake 8: plugin architecture

https://snakemake.github.io

Simplicity and automation

Automatic documentation

https://snakemake.github.io

Datavzrd

Tables are the central entity
in data analysis

https://datavzrd.github.io

Not always a single table

oncoprint + individual variant calls

differentially expressed genes + expression matrix

https://datavzrd.github.io

Not always just a table

oncoprint + individual variant calls +

differentially expressed genes + expression matrix +

https://datavzrd.github.io

State of the art

Individual tables (tsv, excel) and plots:

easy to publish
limited interactivity
no jumping between corresponding items

Web applications (custom, shiny, ...):

running server (or local installation)
implementation overhead
long-term maintenance is challenging

https://datavzrd.github.io

The problem

Input:

set of tables
relations between tables
set of rendering definitions

Output:

portable interactive visual presentation

https://datavzrd.github.io

datasets:
  oscars:
    path: "data/oscars.csv"
    links:
      link to oscar plot:
        column: age
        view: oscar-plot
      link to movie:
        column: movie
        table-row: movies/Title

  movies:
    path: "data/movies.csv"
    links:
      link to oscar entry:
        column: Title
        table-row: oscars/movie

views:
  oscars:
    dataset: oscars
    desc: |
      ### All winning oscars beginning in the year 1929.
      This table contains *all* winning oscars for best
      actress and best actor.
    page-size: 25
    render-table:
      columns:
        age:
          plot:
            ticks:
              scale: linear
              domain:
                - 20
                - 100
        name:
          link-to-url: "https://lmgtfy.app/?q=Is {name} in {movie}?"
        movie:
          link-to-url: "https://de.wikipedia.org/wiki/{value}"
        award:
          plot:
            heatmap:
              scale: ordinal
              domain:
                - Best actor
                - Best actress
              range:
                - "#add8e6"
                - "#ffb6c1"
        index(0):
          display-mode: hidden
        regex('birth_.+'):
          display-mode: detail

  movies:
    dataset: movies
    render-table:
      columns:
        Genre:
          ellipsis: 15
        imdbID:
          link-to-url: "https://www.imdb.com/title/{value}/"
        Title:
          link-to-url: "https://de.wikipedia.org/wiki/{value}"
        imdbRating:
          precision: 1
          plot:
            bars:
              scale: linear
              domain:
                - 1
                - 10
        Rated:
          plot-view-legend: true
          plot:
            heatmap:
              scale: ordinal
              color-scheme: accent

  oscar-plot:
    dataset: oscars
    desc: |
      ## My beautiful oscar scatter plot
      *So many great actors and actresses*
    render-plot:
      spec-path: ".examples/specs/oscars.vl.json"

  movies-plot:
    dataset: movies
    desc: |
      All movies with its *runtime* and *ratings* plotted over *time*.
    render-plot:
      spec-path: ".examples/specs/movies.vl.json"

The solution: Datavzrd

https://datavzrd.github.io

Portability

├── index.html
├── movies
│   ├── index_1.html
│   └── table.js
├── oscars
│   ├── index_1.html
│   └── table.js
├── movies-plot
│   └── index_1.html
├── oscar-plot
│   └── index_1.html
└── static
    ├── bootstrap.bundle.min.js
    ├── bootstrap.min.css
    ├── bootstrap-select.min.css
    ├── bootstrap-select.min.js
    ├── bootstrap-table-filter-control.min.js
    ├── bootstrap-table-fixed-columns.min.css
    ├── bootstrap-table-fixed-columns.min.js
    ├── bootstrap-table.min.css
    ├── bootstrap-table.min.js
    ├── datavzrd.css
    ├── jquery.min.js
    ├── jsonm.min.js
    ├── lz-string.min.js
    ├── showdown.min.js
    ├── vega-embed.min.js
    ├── vega-lite.min.js
    └── vega.min.js

static HTML, no server needed
load data via script tags
javascript compliant compression
\(\leq\) 30000 rows: in-memory filter/sort/paging
\(>\) 30000 rows: paging/search projected to filesystem structure

https://datavzrd.github.io

Datavzrd + Snakemake

rule datavzrd:
    input:
        config="resources/{sample}.datavzrd.yaml",
        table="data/{sample}.tsv",
    output:
        report(
            directory("results/datavzrd-report/{sample}"),
            htmlindex="index.html",
        ),
    wrapper:
        "v4.6.0/utils/datavzrd"

https://datavzrd.github.io

Datavzrd + Snakemake

https://datavzrd.github.io

Varlociraptor

Johannes Köster, Louis Dijkstra, Tobias Marschall, Alexander Schönhuth.
Varlociraptor: enhancing sensitivity and controlling false discovery rate in somatic indel discovery. Genome Biology 21, 2020

https://varlociraptor.github.io

Traditional variant calling approach:

detection

genotyping/VAF prediction

Varlociraptor approach:

candidate variant calling

event calling

same tool

any tool

Varlociraptor

Johannes Köster, Louis Dijkstra, Tobias Marschall, Alexander Schönhuth.
Varlociraptor: enhancing sensitivity and controlling false discovery rate in somatic indel discovery. Genome Biology 21, 2020

https://varlociraptor.github.io

Varlociraptor

germline: "normal:0.5 | normal:1.0"
somatic_normal: "normal:]0.0,0.5["
somatic_tumor: "normal:0.0 & tumor:]0.0,1.0]"
absent: "normal:0.0 & tumor:0.0

https://varlociraptor.github.io

Varlociraptor model building block

\(\xi_i \sim \text{Bernoulli}(\theta \tau)\)

\(\omega_i \sim Bernoulli(\pi_i)\)

\(Z_i \mid \xi_i, \omega_i=1,\beta,\delta \sim\)

\(\beta, \delta\)

allele frequency

sampling bias

allele uncertainty

biases/artifacts (strand, orientation, softclip, homopolymer, ...)

alignment uncertainty

https://varlociraptor.github.io

Variant calling grammar

species:
  heterozygosity: 0.001
  ploidy:
    male:
      all: 2
      X: 1
      Y: 1
    female:
      all: 2
      X: 2
      Y: 0

samples:
  jane:
    sex: female

events:
  present: "jane:0.5 | jane:1.0"

https://varlociraptor.github.io

species:
  heterozygosity: 0.001
  ploidy:
    male:
      all: 2
      X: 1
      Y: 1
    female:
      all: 2
      X: 2
      Y: 0

samples:
  jane:
    sex: female
  john:
    sex: male
  james:
    sex: male
    inheritance:
      mendelian:
        mother: jane
        father: john

events:
  john: "john:0.5 | john:1.0"
  jane: "jane:0.5 | jane:1.0"
  denovo_james: "(james:0.5 | james:1.0) & !$jane & !$john"

Variant calling grammar

https://varlociraptor.github.io

species:
  heterozygosity: 0.001
  ploidy:
    male:
      all: 2
      X: 1
      Y: 1
    female:
      all: 2
      X: 2
      Y: 0

samples:
  jane:
    sex: female
    somatic-effective-mutation-rate: 1e-10
  tumor:
    inheritance:
      clonal:
        from: jane
    contamination:
      by: jane
      fraction: 0.1
    somatic-effective-mutation-rate: 1e-6

events:
  germline: "jane:0.5 | jane:1.0"
  somatic: "jane:]0.0,0.5["
  somatic_tumor_low: "jane:0.0 & tumor:]0.0,0.1["
  somatic_tumor_high: "jane:0.0 & tumor:[0.1,1.0]"

Variant calling grammar

https://varlociraptor.github.io

samples:
  jane:
    sex: female
    somatic-effective-mutation-rate: 1e-10
  tumor:
    inheritance:
      clonal:
        from: jane
    contamination:
      by: jane
      fraction: 0.1
    somatic-effective-mutation-rate: 1e-6
  relapse:
    inheritance:
      clonal:
        from: jane
    contamination:
      by: jane
      fraction: 0.2
    somatic-effective-mutation-rate: 1e-6

expressions:
  somatic_tumor: "jane:0.0 & tumor:]0.0,1.0]"

events:
  germline: "jane:0.5 | jane:1.0"
  somatic: "jane:]0.0,0.5["
  somatic_tumor_no_increase: "$somatic_tumor & l2fc(relapse,tumor) < 1"
  somatic_tumor_increase: "$somatic_tumor & l2fc(relapse,tumor) >= 1"
  somatic_relapse: "jane:0.0 & tumor:0.0 & relapse:]0.0,1.0]"

Variant calling grammar

https://varlociraptor.github.io

species:
  heterozygosity: 0.001
  ploidy:
    male:
      all: 2
      X: 1
      Y: 1
    female:
      all: 2
      X: 2
      Y: 0

samples:
  dna_illumina:
    sex: female
  dna_nanopore:
    inheritance:
      clonal:
        from: dna_illumina
  rna_illumina:
    universe: [0.0,1.0]

events:
  het: "dna_illumina:0.5 & dna_nanopore:0.5 & rna_illumina:]0.0,1.0]"
  hom: "dna_illumina:1.0 & dna_nanopore:1.0 & rna_illumina:1.0"
  rna_editing: "dna_illumina:0.0 & dna_nanopore:0.0 & rna_illumina:]0.0,1.0]"

Variant calling grammar

https://varlociraptor.github.io

Long read support

https://varlociraptor.github.io

Discovering nucleotide level and structural variants in cancer genome data from second and third generation sequencing technologies. Till Hartmann, PhD thesis 2024

Continuous testing

varlociraptor call variants --testcase-prefix testcase --testcase-locus CHROM:POS generic \
  --scenario scenario.yaml --obs tumor=tumor.bcf normal=normal.bcf

Automatic test case generation:

145

public testcases (simulated + real benchmarks)

private testcases
(clinical)

https://varlociraptor.github.io

Reporting with Datavzrd

https://varlociraptor.github.io

Beyond just variant calling

Haplotype quantification:

utilizing Varlociraptor's posterior allele frequency distributions for virus variant quantification and HLA typing

Methylation calling:

joint consideration of methylation and variation, any sequencing technology

CNV calling:

joint consideration of depth and alignment evidence for arbitray scenarios

Optical mapping evidence:

extension of statistical model to consider optical mapping label positions to determine SV allele likelihoods

Phased variant impact prediction:

leveraging matching between Varlociraptor observations and variants

Fusion calling:

thorough statistical approach for fusion calling on DNA, RNA, short and long reads

Acknowledgements

Koesterlab

Laura Kühle

David Lähnemann

Felix Mölder

Adrian Prinz

Hamdiye Uzuner

Felix Wiegand

Can Özkan

Andrea Tonk

Till Hartmann (alumn.)

Snakemake Community

Christian Meesters

Michael B. Hall

Filipe G. Vieira

Morten E. Lund

Vanessa Sochat

Alexaner Kanitz

Kim-Phillip Jablonski

Brice Letcher

Michael B. Hall

Chris Tomkins-Tinch

Sven O. Twardziok

Manuel Holtgrewe

+ hundreds of contributors

Others

Sven Rahmann

Alex Schönhuth

Shirley Liu

Myles Brown

Henry Long

Louis Dijkstra

Tobias Marschall

Marcel Martin

Sven Nahnsen

Alex Schramm

Ina Pretzell

Michael Wessolly

Thomas Herold

Martin Schuler

And thanks to all the users and contributors

Copy of Invited talk at Schwarz lab

By Johannes Köster

Copy of Invited talk at Schwarz lab

Generic solutions for specific problems: Snakemake, Datavzrd, and Varlociraptor in 2025

Snakemake

Snakemake

Define workflows

in terms of rules

Define workflows

in terms of rules

Define workflows

in terms of rules

Define workflows

in terms of rules

Automatic inference of DAG of jobs

Scalable to any platform

Self-contained HTML reports

Snakemake 8: plugin architecture

Simplicity and automation

Datavzrd

Tables are the central entity in data analysis

Not always a single table

Not always just a table

State of the art

The problem

The solution: Datavzrd

Portability

Datavzrd + Snakemake

Datavzrd + Snakemake

Varlociraptor

Varlociraptor

Varlociraptor

Varlociraptor

Varlociraptor model building block

Variant calling grammar

Variant calling grammar

Variant calling grammar

Variant calling grammar

Variant calling grammar

Long read support

Continuous testing

Reporting with Datavzrd

Beyond just variant calling

Acknowledgements

Copy of Invited talk at Schwarz lab

More from Johannes Köster

Tables are the central entity
in data analysis