Generic solutions for specific problems: Snakemake, Datavzrd, Bioconda,
and Varlociraptor in 2024
Johannes Köster
GCB 2024
Snakemake
Snakemake
>1 million downloads since 2015
>3000 citations
>12 citations per week in 2023
affiliated
Mölder, F., Jablonski, K.P., Letcher, B., Hall, M.B., Tomkins-Tinch, C.H., Sochat, V., Forster, J., Lee, S., Twardziok, S.O., Kanitz, A., Wilm, A., Holtgrewe, M., Rahmann, S., Nahnsen, S., Köster, J. Sustainable data analysis with Snakemake. F1000Res 10, 33 (2021).
https://snakemake.github.io
dataset
results
dataset
dataset
dataset
dataset
dataset
https://snakemake.github.io
Define workflows
in terms of rules
https://snakemake.github.io
Define workflows
in terms of rules
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
script:
"scripts/myscript.R"
rule myfiltration:
input:
"result/{dataset}.txt"
output:
"result/{dataset}.filtered.txt"
shell:
"mycommand {input} > {output}"
rule aggregate:
input:
"results/dataset1.filtered.txt",
"results/dataset2.filtered.txt"
output:
"plots/myplot.pdf"
script:
"scripts/myplot.R"
https://snakemake.github.io
Define workflows
in terms of rules
Define workflows
in terms of rules
rule mytask:
input:
"data/{sample}.txt"
output:
"result/{sample}.txt"
params:
someparam=0.1
conda:
"envs/sometool.yaml"
shell:
"some-tool {params.someparam} {input} > {output}"
- simple paths
- helpers for branching
and aggregation - Python logic
- shell commands
- ad-hoc scripts and notebooks (Python, R, Julia, Rust, Bash)
- wrappers
https://snakemake.github.io
deployment via
conda, containers
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
script:
"scripts/myscript.R"
rule myfiltration:
input:
"result/{dataset}.txt"
output:
"result/{dataset}.filtered.txt"
shell:
"mycommand {input} > {output}"
rule aggregate:
input:
"results/dataset1.filtered.txt",
"results/dataset2.filtered.txt"
output:
"plots/myplot.pdf"
script:
"scripts/myplot.R"
Automatic inference of DAG of jobs
https://snakemake.github.io
Scalable to any platform
workstation
compute server
cluster
grid computing
cloud computing
https://snakemake.github.io
Self-contained HTML reports
https://snakemake.github.io
Snakemake 8: plugin architecture
https://snakemake.github.io
Simplicity and automation
Object oriented implementation of a few abstract methods
class Executor(RemoteExecutor):
def run_job(self, job: JobExecutorInterface):
...
async def check_active_jobs(
self, active_jobs: List[SubmittedJobInfo]
) -> Generator[SubmittedJobInfo, None, None]:
...
def cancel_jobs(self, active_jobs: List[SubmittedJobInfo]):
...
https://snakemake.github.io
Simplicity and automation
Dataclass for CLI arg registry
@dataclass
class ExecutorSettings(ExecutorSettingsBase):
namespace: str = field(
default="default", metadata={"help": "The namespace to use for submitted jobs."}
)
cpu_scalar: float = field(
default=0.95,
metadata={
"help": "K8s reserves some proportion of available CPUs for its own use. "
"So, where an underlying node may have 8 CPUs, only e.g. 7600 milliCPUs "
"are allocatable to k8s pods (i.e. snakemake jobs). As 8 > 7.6, k8s can't "
"find a node with enough CPU resource to run such jobs. This argument acts "
"as a global scalar on each job's CPU request, so that e.g. a job whose "
"rule definition asks for 8 CPUs will request 7600m CPUs from k8s, "
"allowing it to utilise one entire node. N.B: the job itself would still "
"see the original value, i.e. as the value substituted in {threads}."
},
)
...
https://snakemake.github.io
Simplicity and automation
Poetry template for skeleton code
# Install poetry plugin via
poetry self add poetry-snakemake-plugin
# Create a new poetry project via
poetry new snakemake-executor-plugin-myfancyexecutor
cd snakemake-executor-plugin-myfancyexecutor
# Scaffold the project as a snakemake executor plugin
poetry scaffold-snakemake-executor-plugin
https://snakemake.github.io
Simplicity and automation
Unified test suites for each plugin type
class TestWorkflows(snakemake.common.tests.TestWorkflowsLocalStorageBase):
__test__ = True
def get_executor(self) -> str:
return "slurm"
def get_executor_settings(self) -> Optional[ExecutorSettingsBase]:
return ExecutorSettings()
class TestWorkflowsRequeue(TestWorkflows):
def get_executor_settings(self) -> Optional[ExecutorSettingsBase]:
return ExecutorSettings(requeue=True)
https://snakemake.github.io
Simplicity and automation
Automatic documentation
https://snakemake.github.io
Bioconda
Software installation is heterogeneous
source("https://bioconductor.org/biocLite.R")
biocLite("DESeq2")
easy_install snakemake
./configure --prefix=/usr/local
make
make install
cp lib/amd64/jli/*.so lib
cp lib/amd64/*.so lib
cp * $PREFIX
cpan -i bioperl
cmake ../../my_project \
-DCMAKE_MODULE_PATH=~/devel/seqan/util/cmake \
-DSEQAN_INCLUDE_PATH=~/devel/seqan/include
make
make install
apt-get install bwa
yum install python-h5py
install.packages("matrixpls")
https://bioconda.github.io
Package management with Conda
package:
name: seqtk
version: 1.2
source:
fn: v1.2.tar.gz
url: https://github.com/lh3/seqtk/archive/v1.2.tar.gz
requirements:
build:
- gcc
- zlib
run:
- zlib
about:
home: https://github.com/lh3/seqtk
license: MIT License
summary: Seqtk is a fast and lightweight tool for processing sequences
test:
commands:
- seqtk seq
#!/bin/bash
export C_INCLUDE_PATH=${PREFIX}/include
export LIBRARY_PATH=${PREFIX}/lib
make all
mkdir -p $PREFIX/bin
cp seqtk $PREFIX/bin
- source or binary
- recipe and build script
- package
- install into isolated environments without admin permissions
https://bioconda.github.io
The Bioconda Project
B Grüning, R Dale, A Sjödin, BA Chapman, J Rowe, CH Tomkins-Tinch, R Valieris, ..., J Köster.
Bioconda: Sustainable and Comprehensive Software Distribution for the Life Sciences. Nature Methods 2018.
https://bioconda.github.io
a Conda channel for the life sciences
The Bioconda Project
B Grüning, R Dale, A Sjödin, BA Chapman, J Rowe, CH Tomkins-Tinch, R Valieris, ..., J Köster.
Bioconda: Sustainable and Comprehensive Software Distribution for the Life Sciences. Nature Methods 2018.
https://bioconda.github.io
>52000
commits
>48000
pull requests
>10800
packages
ISMB Proceedings 2023
Software
Scripts
https://bioconda.github.io
Recently added:
ARM support (Apple Silicon and Linux)
Potentially up next:
WASM support
https://bioconda.github.io
Datavzrd
Tables are the central entity
in data analysis
https://datavzrd.github.io
Not always a single table
oncoprint + individual variant calls
differentially expressed genes + expression matrix
https://datavzrd.github.io
Not always just a table
oncoprint + individual variant calls +
differentially expressed genes + expression matrix +
https://datavzrd.github.io
State of the art
Individual tables (tsv, excel) and plots:
- easy to publish
- limited interactivity
- no jumping between corresponding items
Web applications (custom, shiny, ...):
- running server (or local installation)
- implementation overhead
- long-term maintenance is challenging
https://datavzrd.github.io
The problem
Input:
- set of tables
- relations between tables
- set of rendering definitions
Output:
portable interactive visual presentation
https://datavzrd.github.io
datasets:
oscars:
path: "data/oscars.csv"
links:
link to oscar plot:
column: age
view: oscar-plot
link to movie:
column: movie
table-row: movies/Title
movies:
path: "data/movies.csv"
links:
link to oscar entry:
column: Title
table-row: oscars/movie
views:
oscars:
dataset: oscars
desc: |
### All winning oscars beginning in the year 1929.
This table contains *all* winning oscars for best
actress and best actor.
page-size: 25
render-table:
columns:
age:
plot:
ticks:
scale: linear
domain:
- 20
- 100
name:
link-to-url: "https://lmgtfy.app/?q=Is {name} in {movie}?"
movie:
link-to-url: "https://de.wikipedia.org/wiki/{value}"
award:
plot:
heatmap:
scale: ordinal
domain:
- Best actor
- Best actress
range:
- "#add8e6"
- "#ffb6c1"
index(0):
display-mode: hidden
regex('birth_.+'):
display-mode: detail
movies:
dataset: movies
render-table:
columns:
Genre:
ellipsis: 15
imdbID:
link-to-url: "https://www.imdb.com/title/{value}/"
Title:
link-to-url: "https://de.wikipedia.org/wiki/{value}"
imdbRating:
precision: 1
plot:
bars:
scale: linear
domain:
- 1
- 10
Rated:
plot-view-legend: true
plot:
heatmap:
scale: ordinal
color-scheme: accent
oscar-plot:
dataset: oscars
desc: |
## My beautiful oscar scatter plot
*So many great actors and actresses*
render-plot:
spec-path: ".examples/specs/oscars.vl.json"
movies-plot:
dataset: movies
desc: |
All movies with its *runtime* and *ratings* plotted over *time*.
render-plot:
spec-path: ".examples/specs/movies.vl.json"
The solution: Datavzrd
https://datavzrd.github.io
Felix Wiegand, P44
https://datavzrd.github.io
Portability
├── index.html
├── movies
│ ├── index_1.html
│ └── table.js
├── oscars
│ ├── index_1.html
│ └── table.js
├── movies-plot
│ └── index_1.html
├── oscar-plot
│ └── index_1.html
└── static
├── bootstrap.bundle.min.js
├── bootstrap.min.css
├── bootstrap-select.min.css
├── bootstrap-select.min.js
├── bootstrap-table-filter-control.min.js
├── bootstrap-table-fixed-columns.min.css
├── bootstrap-table-fixed-columns.min.js
├── bootstrap-table.min.css
├── bootstrap-table.min.js
├── datavzrd.css
├── jquery.min.js
├── jsonm.min.js
├── lz-string.min.js
├── showdown.min.js
├── vega-embed.min.js
├── vega-lite.min.js
└── vega.min.js
- static HTML, no server needed
- load data via script tags
- javascript compliant compression
- \(\leq\) 30000 rows: in-memory filter/sort/paging
- \(>\) 30000 rows: paging/search projected to filesystem structure
https://datavzrd.github.io
Datavzrd + Snakemake
rule datavzrd:
input:
config="resources/{sample}.datavzrd.yaml",
table="data/{sample}.tsv",
output:
report(
directory("results/datavzrd-report/{sample}"),
htmlindex="index.html",
),
wrapper:
"v4.6.0/utils/datavzrd"
https://datavzrd.github.io
Datavzrd + Snakemake
https://datavzrd.github.io
Varlociraptor
Varlociraptor
Johannes Köster, Louis Dijkstra, Tobias Marschall, Alexander Schönhuth.
Varlociraptor: enhancing sensitivity and controlling false discovery rate in somatic indel discovery. Genome Biology 21, 2020
https://varlociraptor.github.io
Traditional variant calling approach:
detection
genotyping/VAF prediction
Varlociraptor approach:
candidate variant calling
event calling
same tool
any tool
Varlociraptor
Varlociraptor
Johannes Köster, Louis Dijkstra, Tobias Marschall, Alexander Schönhuth.
Varlociraptor: enhancing sensitivity and controlling false discovery rate in somatic indel discovery. Genome Biology 21, 2020
https://varlociraptor.github.io
Varlociraptor
germline: "normal:0.5 | normal:1.0"
somatic_normal: "normal:]0.0,0.5["
somatic_tumor: "normal:0.0 & tumor:]0.0,1.0]"
absent: "normal:0.0 & tumor:0.0
https://varlociraptor.github.io
Varlociraptor model building block
\(\xi_i \sim \text{Bernoulli}(\theta \tau)\)
\(\omega_i \sim Bernoulli(\pi_i)\)
\(Z_i \mid \xi_i, \omega_i=1,\beta,\delta \sim\)
\(\beta, \delta\)
- allele frequency
- sampling bias
- allele uncertainty
- biases/artifacts (strand, orientation, softclip, homopolymer, ...)
- alignment uncertainty
https://varlociraptor.github.io
Variant calling grammar
species:
heterozygosity: 0.001
ploidy:
male:
all: 2
X: 1
Y: 1
female:
all: 2
X: 2
Y: 0
samples:
jane:
sex: female
events:
present: "jane:0.5 | jane:1.0"
https://varlociraptor.github.io
species:
heterozygosity: 0.001
ploidy:
male:
all: 2
X: 1
Y: 1
female:
all: 2
X: 2
Y: 0
samples:
jane:
sex: female
john:
sex: male
james:
sex: male
inheritance:
mendelian:
mother: jane
father: john
events:
john: "john:0.5 | john:1.0"
jane: "jane:0.5 | jane:1.0"
denovo_james: "(james:0.5 | james:1.0) & !$jane & !$john"
Variant calling grammar
https://varlociraptor.github.io
species:
heterozygosity: 0.001
ploidy:
male:
all: 2
X: 1
Y: 1
female:
all: 2
X: 2
Y: 0
samples:
jane:
sex: female
somatic-effective-mutation-rate: 1e-10
tumor:
inheritance:
clonal:
from: jane
contamination:
by: jane
fraction: 0.1
somatic-effective-mutation-rate: 1e-6
events:
germline: "jane:0.5 | jane:1.0"
somatic: "jane:]0.0,0.5["
somatic_tumor_low: "jane:0.0 & tumor:]0.0,0.1["
somatic_tumor_high: "jane:0.0 & tumor:[0.1,1.0]"
Variant calling grammar
https://varlociraptor.github.io
samples:
jane:
sex: female
somatic-effective-mutation-rate: 1e-10
tumor:
inheritance:
clonal:
from: jane
contamination:
by: jane
fraction: 0.1
somatic-effective-mutation-rate: 1e-6
relapse:
inheritance:
clonal:
from: jane
contamination:
by: jane
fraction: 0.2
somatic-effective-mutation-rate: 1e-6
expressions:
somatic_tumor: "jane:0.0 & tumor:]0.0,1.0]"
events:
germline: "jane:0.5 | jane:1.0"
somatic: "jane:]0.0,0.5["
somatic_tumor_no_increase: "$somatic_tumor & l2fc(relapse,tumor) < 1"
somatic_tumor_increase: "$somatic_tumor & l2fc(relapse,tumor) >= 1"
somatic_relapse: "jane:0.0 & tumor:0.0 & relapse:]0.0,1.0]"
Variant calling grammar
https://varlociraptor.github.io
species:
heterozygosity: 0.001
ploidy:
male:
all: 2
X: 1
Y: 1
female:
all: 2
X: 2
Y: 0
samples:
dna_illumina:
sex: female
dna_nanopore:
inheritance:
clonal:
from: dna_illumina
rna_illumina:
universe: [0.0,1.0]
events:
het: "dna_illumina:0.5 & dna_nanopore:0.5 & rna_illumina:]0.0,1.0]"
hom: "dna_illumina:1.0 & dna_nanopore:1.0 & rna_illumina:1.0"
rna_editing: "dna_illumina:0.0 & dna_nanopore:0.0 & rna_illumina:]0.0,1.0]"
Variant calling grammar
https://varlociraptor.github.io
Long read support
https://varlociraptor.github.io
Discovering nucleotide level and structural variants in cancer genome data from second and third generation sequencing technologies. Till Hartmann, PhD thesis 2024
Continuous testing
varlociraptor call variants --testcase-prefix testcase --testcase-locus CHROM:POS generic \
--scenario scenario.yaml --obs tumor=tumor.bcf normal=normal.bcf
Automatic test case generation:
145
public testcases (simulated + real benchmarks)
66
private testcases
(clinical)
https://varlociraptor.github.io
Reporting with Datavzrd
https://varlociraptor.github.io
Beyond just variant calling
Haplotype quantification:
utilizing Varlociraptor's using posterior allele frequency distributions [Hamdiye Uzuner, talk today]
Methylation calling:
joint consideration of methylation and variation, any sequencing technology [Adrian Prinz, P40]
Optical mapping evidence:
extension of statistical model to consider optical mapping label positions to determine SV allele likelihoods [Laura Kühle]
Phased variant impact prediction:
leveraging matching between Varlociraptor observations and variants [Felix Wiegand]
Fusion calling:
using generic breakends [Felix Mölder]
Acknowledgements
Koesterlab
Laura Kühle
David Lähnemann
Felix Mölder
Adrian Prinz
Hamdiye Uzuner
Felix Wiegand
Can Özkan
Andrea Tonk
Till Hartmann (alumn.)
Snakemake Community
Christian Meesters
Michael B. Hall
Filipe G. Vieira
Morten E. Lund
Vanessa Sochat
Alexaner Kanitz
Kim-Phillip Jablonski
Brice Letcher
Michael B. Hall
Chris Tomkins-Tinch
Sven O. Twardziok
Manuel Holtgrewe
+ hundreds of contributors
Bioconda Community
Björn Grüning
Ryan Dale
Marcell Bargull
Brad Chapman
Chris Tomkins-Tinch
Andreas Sjödin
Devon Ryan
Elmar Pruesse
Robert A. Petit III
Christian Brueffer
Alicia Evans
+ hundreds of contributors
Others
Sven Rahmann
Alex Schönhuth
Shirley Liu
Myles Brown
Henry Long
Louis Dijkstra
Tobias Marschall
Marcel Martin
Sven Nahnsen
Alex Schramm
Ina Pretzell
Michael Wessolly
Thomas Herold
Martin Schuler
all users of Snakemake, Bioconda, Datavzrd, and Varlociraptor
Snakemake hackathon 2025
March 10-14 at
Work on:
- plugins
- wrappers
- workflows
- internals
Visit LHC:
And shape the future of the ecosystem!
Registration will open soon, email me to get a notification: johannes.koester@uni-due.de
free lodging sponsored by FAIROS-HEP!
Keynote GCB 2024
By Johannes Köster
Keynote GCB 2024
- 74