Generic solutions for specific problems: Snakemake, Datavzrd, and Varlociraptor in 2025
Johannes Köster

Snakemake
Snakemake
>1 million downloads since 2015
>3000 citations
>14 citations per week in 2023
affiliated


Mölder, F., Jablonski, K.P., Letcher, B., Hall, M.B., Tomkins-Tinch, C.H., Sochat, V., Forster, J., Lee, S., Twardziok, S.O., Kanitz, A., Wilm, A., Holtgrewe, M., Rahmann, S., Nahnsen, S., Köster, J. Sustainable data analysis with Snakemake. F1000Res 10, 33 (2021).
https://snakemake.github.io
dataset
results
dataset
dataset
dataset
dataset
dataset
https://snakemake.github.io
Define workflows
in terms of rules
https://snakemake.github.io
Define workflows
in terms of rules
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
script:
"scripts/myscript.R"
rule myfiltration:
input:
"result/{dataset}.txt"
output:
"result/{dataset}.filtered.txt"
shell:
"mycommand {input} > {output}"
rule aggregate:
input:
"results/dataset1.filtered.txt",
"results/dataset2.filtered.txt"
output:
"plots/myplot.pdf"
script:
"scripts/myplot.R"
https://snakemake.github.io
Define workflows
in terms of rules
Define workflows
in terms of rules
rule mytask:
input:
"data/{sample}.txt"
output:
"result/{sample}.txt"
params:
someparam=0.1
conda:
"envs/sometool.yaml"
shell:
"some-tool {params.someparam} {input} > {output}"
- simple paths
- helpers for branching
and aggregation - Python logic
- shell commands
- ad-hoc scripts and notebooks (Python, R, Julia, Rust, Bash)
- wrappers
https://snakemake.github.io
deployment via
conda, containers
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
script:
"scripts/myscript.R"
rule myfiltration:
input:
"result/{dataset}.txt"
output:
"result/{dataset}.filtered.txt"
shell:
"mycommand {input} > {output}"
rule aggregate:
input:
"results/dataset1.filtered.txt",
"results/dataset2.filtered.txt"
output:
"plots/myplot.pdf"
script:
"scripts/myplot.R"
Automatic inference of DAG of jobs
https://snakemake.github.io
Scalable to any platform



workstation
compute server
cluster






grid computing
cloud computing

https://snakemake.github.io
Self-contained HTML reports
https://snakemake.github.io
Snakemake 8: plugin architecture
https://snakemake.github.io
Simplicity and automation
Automatic documentation
https://snakemake.github.io
Datavzrd
Tables are the central entity
in data analysis


https://datavzrd.github.io
Not always a single table
oncoprint + individual variant calls
differentially expressed genes + expression matrix
https://datavzrd.github.io
Not always just a table
oncoprint + individual variant calls +
differentially expressed genes + expression matrix +

https://datavzrd.github.io
State of the art
Individual tables (tsv, excel) and plots:
- easy to publish
- limited interactivity
- no jumping between corresponding items
Web applications (custom, shiny, ...):
- running server (or local installation)
- implementation overhead
- long-term maintenance is challenging
https://datavzrd.github.io
The problem
Input:
- set of tables
- relations between tables
- set of rendering definitions
Output:
portable interactive visual presentation
https://datavzrd.github.io
datasets:
oscars:
path: "data/oscars.csv"
links:
link to oscar plot:
column: age
view: oscar-plot
link to movie:
column: movie
table-row: movies/Title
movies:
path: "data/movies.csv"
links:
link to oscar entry:
column: Title
table-row: oscars/movie
views:
oscars:
dataset: oscars
desc: |
### All winning oscars beginning in the year 1929.
This table contains *all* winning oscars for best
actress and best actor.
page-size: 25
render-table:
columns:
age:
plot:
ticks:
scale: linear
domain:
- 20
- 100
name:
link-to-url: "https://lmgtfy.app/?q=Is {name} in {movie}?"
movie:
link-to-url: "https://de.wikipedia.org/wiki/{value}"
award:
plot:
heatmap:
scale: ordinal
domain:
- Best actor
- Best actress
range:
- "#add8e6"
- "#ffb6c1"
index(0):
display-mode: hidden
regex('birth_.+'):
display-mode: detail
movies:
dataset: movies
render-table:
columns:
Genre:
ellipsis: 15
imdbID:
link-to-url: "https://www.imdb.com/title/{value}/"
Title:
link-to-url: "https://de.wikipedia.org/wiki/{value}"
imdbRating:
precision: 1
plot:
bars:
scale: linear
domain:
- 1
- 10
Rated:
plot-view-legend: true
plot:
heatmap:
scale: ordinal
color-scheme: accent
oscar-plot:
dataset: oscars
desc: |
## My beautiful oscar scatter plot
*So many great actors and actresses*
render-plot:
spec-path: ".examples/specs/oscars.vl.json"
movies-plot:
dataset: movies
desc: |
All movies with its *runtime* and *ratings* plotted over *time*.
render-plot:
spec-path: ".examples/specs/movies.vl.json"

The solution: Datavzrd
https://datavzrd.github.io
https://datavzrd.github.io
Portability
├── index.html
├── movies
│ ├── index_1.html
│ └── table.js
├── oscars
│ ├── index_1.html
│ └── table.js
├── movies-plot
│ └── index_1.html
├── oscar-plot
│ └── index_1.html
└── static
├── bootstrap.bundle.min.js
├── bootstrap.min.css
├── bootstrap-select.min.css
├── bootstrap-select.min.js
├── bootstrap-table-filter-control.min.js
├── bootstrap-table-fixed-columns.min.css
├── bootstrap-table-fixed-columns.min.js
├── bootstrap-table.min.css
├── bootstrap-table.min.js
├── datavzrd.css
├── jquery.min.js
├── jsonm.min.js
├── lz-string.min.js
├── showdown.min.js
├── vega-embed.min.js
├── vega-lite.min.js
└── vega.min.js
- static HTML, no server needed
- load data via script tags
- javascript compliant compression
- \(\leq\) 30000 rows: in-memory filter/sort/paging
- \(>\) 30000 rows: paging/search projected to filesystem structure
https://datavzrd.github.io
Datavzrd + Snakemake
rule datavzrd:
input:
config="resources/{sample}.datavzrd.yaml",
table="data/{sample}.tsv",
output:
report(
directory("results/datavzrd-report/{sample}"),
htmlindex="index.html",
),
wrapper:
"v4.6.0/utils/datavzrd"
https://datavzrd.github.io
Datavzrd + Snakemake

https://datavzrd.github.io
Varlociraptor
Varlociraptor
Johannes Köster, Louis Dijkstra, Tobias Marschall, Alexander Schönhuth.
Varlociraptor: enhancing sensitivity and controlling false discovery rate in somatic indel discovery. Genome Biology 21, 2020
https://varlociraptor.github.io
Traditional variant calling approach:
detection
genotyping/VAF prediction
Varlociraptor approach:
candidate variant calling
event calling
same tool
any tool
Varlociraptor
Varlociraptor
Johannes Köster, Louis Dijkstra, Tobias Marschall, Alexander Schönhuth.
Varlociraptor: enhancing sensitivity and controlling false discovery rate in somatic indel discovery. Genome Biology 21, 2020
https://varlociraptor.github.io
Varlociraptor
germline: "normal:0.5 | normal:1.0"
somatic_normal: "normal:]0.0,0.5["
somatic_tumor: "normal:0.0 & tumor:]0.0,1.0]"
absent: "normal:0.0 & tumor:0.0
https://varlociraptor.github.io
Varlociraptor model building block
\(\xi_i \sim \text{Bernoulli}(\theta \tau)\)
\(\omega_i \sim Bernoulli(\pi_i)\)
\(Z_i \mid \xi_i, \omega_i=1,\beta,\delta \sim\)

\(\beta, \delta\)
- allele frequency
- sampling bias
- allele uncertainty
- biases/artifacts (strand, orientation, softclip, homopolymer, ...)
- alignment uncertainty
https://varlociraptor.github.io
Variant calling grammar
species:
heterozygosity: 0.001
ploidy:
male:
all: 2
X: 1
Y: 1
female:
all: 2
X: 2
Y: 0
samples:
jane:
sex: female
events:
present: "jane:0.5 | jane:1.0"
https://varlociraptor.github.io
species:
heterozygosity: 0.001
ploidy:
male:
all: 2
X: 1
Y: 1
female:
all: 2
X: 2
Y: 0
samples:
jane:
sex: female
john:
sex: male
james:
sex: male
inheritance:
mendelian:
mother: jane
father: john
events:
john: "john:0.5 | john:1.0"
jane: "jane:0.5 | jane:1.0"
denovo_james: "(james:0.5 | james:1.0) & !$jane & !$john"
Variant calling grammar
https://varlociraptor.github.io
species:
heterozygosity: 0.001
ploidy:
male:
all: 2
X: 1
Y: 1
female:
all: 2
X: 2
Y: 0
samples:
jane:
sex: female
somatic-effective-mutation-rate: 1e-10
tumor:
inheritance:
clonal:
from: jane
contamination:
by: jane
fraction: 0.1
somatic-effective-mutation-rate: 1e-6
events:
germline: "jane:0.5 | jane:1.0"
somatic: "jane:]0.0,0.5["
somatic_tumor_low: "jane:0.0 & tumor:]0.0,0.1["
somatic_tumor_high: "jane:0.0 & tumor:[0.1,1.0]"
Variant calling grammar
https://varlociraptor.github.io
samples:
jane:
sex: female
somatic-effective-mutation-rate: 1e-10
tumor:
inheritance:
clonal:
from: jane
contamination:
by: jane
fraction: 0.1
somatic-effective-mutation-rate: 1e-6
relapse:
inheritance:
clonal:
from: jane
contamination:
by: jane
fraction: 0.2
somatic-effective-mutation-rate: 1e-6
expressions:
somatic_tumor: "jane:0.0 & tumor:]0.0,1.0]"
events:
germline: "jane:0.5 | jane:1.0"
somatic: "jane:]0.0,0.5["
somatic_tumor_no_increase: "$somatic_tumor & l2fc(relapse,tumor) < 1"
somatic_tumor_increase: "$somatic_tumor & l2fc(relapse,tumor) >= 1"
somatic_relapse: "jane:0.0 & tumor:0.0 & relapse:]0.0,1.0]"
Variant calling grammar
https://varlociraptor.github.io
species:
heterozygosity: 0.001
ploidy:
male:
all: 2
X: 1
Y: 1
female:
all: 2
X: 2
Y: 0
samples:
dna_illumina:
sex: female
dna_nanopore:
inheritance:
clonal:
from: dna_illumina
rna_illumina:
universe: [0.0,1.0]
events:
het: "dna_illumina:0.5 & dna_nanopore:0.5 & rna_illumina:]0.0,1.0]"
hom: "dna_illumina:1.0 & dna_nanopore:1.0 & rna_illumina:1.0"
rna_editing: "dna_illumina:0.0 & dna_nanopore:0.0 & rna_illumina:]0.0,1.0]"
Variant calling grammar
https://varlociraptor.github.io
Long read support
https://varlociraptor.github.io


Discovering nucleotide level and structural variants in cancer genome data from second and third generation sequencing technologies. Till Hartmann, PhD thesis 2024
Continuous testing
varlociraptor call variants --testcase-prefix testcase --testcase-locus CHROM:POS generic \
--scenario scenario.yaml --obs tumor=tumor.bcf normal=normal.bcf
Automatic test case generation:
145
public testcases (simulated + real benchmarks)
66
private testcases
(clinical)
https://varlociraptor.github.io

Reporting with Datavzrd
https://varlociraptor.github.io
Beyond just variant calling
Haplotype quantification:
utilizing Varlociraptor's posterior allele frequency distributions for virus variant quantification and HLA typing
Methylation calling:
joint consideration of methylation and variation, any sequencing technology
CNV calling:
joint consideration of depth and alignment evidence for arbitray scenarios
Optical mapping evidence:
extension of statistical model to consider optical mapping label positions to determine SV allele likelihoods
Phased variant impact prediction:
leveraging matching between Varlociraptor observations and variants
Fusion calling:
thorough statistical approach for fusion calling on DNA, RNA, short and long reads
Acknowledgements
Koesterlab
Laura Kühle
David Lähnemann
Felix Mölder
Adrian Prinz
Hamdiye Uzuner
Felix Wiegand
Can Özkan
Andrea Tonk
Till Hartmann (alumn.)
Snakemake Community
Christian Meesters
Michael B. Hall
Filipe G. Vieira
Morten E. Lund
Vanessa Sochat
Alexaner Kanitz
Kim-Phillip Jablonski
Brice Letcher
Michael B. Hall
Chris Tomkins-Tinch
Sven O. Twardziok
Manuel Holtgrewe
+ hundreds of contributors
Others
Sven Rahmann
Alex Schönhuth
Shirley Liu
Myles Brown
Henry Long
Louis Dijkstra
Tobias Marschall
Marcel Martin
Sven Nahnsen
Alex Schramm
Ina Pretzell
Michael Wessolly
Thomas Herold
Martin Schuler
And thanks to all the users and contributors

Copy of Invited talk at Schwarz lab
By Johannes Köster
Copy of Invited talk at Schwarz lab
- 127