Transparency, reproducibility and the democratization of an ecosystem - the benefits of Snakemake 8
Johannes Köster
2023
dataset
results
Data analysis
"Let me do that by hand..."
dataset
results
dataset
dataset
dataset
dataset
dataset
"Let me do that by hand..."
Data analysis
>970k downloads since 2015
>2700 citations
>11 citations per week in 2023
dataset
results
dataset
dataset
dataset
dataset
dataset
Define workflows
in terms of rules
Define workflows
in terms of rules
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
script:
"scripts/myscript.R"
rule myfiltration:
input:
"result/{dataset}.txt"
output:
"result/{dataset}.filtered.txt"
shell:
"mycommand {input} > {output}"
rule aggregate:
input:
"results/dataset1.filtered.txt",
"results/dataset2.filtered.txt"
output:
"plots/myplot.pdf"
script:
"scripts/myplot.R"
Define workflows
in terms of rules
Define workflows
in terms of rules
rule mytask:
input:
"data/{sample}.txt"
output:
"result/{sample}.txt"
shell:
"some-tool {input} > {output}"
rule name
how to create output from input
define
- input
- output
- log files
- parameters
- resources
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
script:
"scripts/myscript.R"
rule myfiltration:
input:
"result/{dataset}.txt"
output:
"result/{dataset}.filtered.txt"
shell:
"mycommand {input} > {output}"
rule aggregate:
input:
"results/dataset1.filtered.txt",
"results/dataset2.filtered.txt"
output:
"plots/myplot.pdf"
script:
"scripts/myplot.R"
Define workflows
in terms of rules
Boilerplate-free integration of scripts
rule mytask:
input:
"data/{sample}.txt"
output:
"result/{sample}.txt"
script:
"scripts/myscript.py"
reusable scripts:
- Python
- R
- Julia
- Rust
- Bash
import pandas as pd
data = pd.read_table(snakemake.input[0])
data = data.sort_values("id")
data.to_csv(snakemake.output[0], sep="\t")
Python:
data <- read.table(snakemake@input[[1]])
data <- data[order(data$id),]
write.table(data, file = snakemake@output[[1]])
Boilerplate-free integration of scripts
R:
import polar as pl
pl.read_csv(&snakemake.input[0])
.sort()
.to_csv(&snakemake.output[0])
Rust:
Jupyter notebook integration
rule mytask:
input:
"data/{sample}.txt"
output:
"result/{sample}.txt"
notebook:
"notebooks/mynotebook.ipynb"
- Integrated interactive edit mode.
- Automatic generalization for reuse in other jobs.
Reusable wrappers
rule map_reads:
input:
"{sample}.bam"
output:
"{sample}.sorted.bam"
wrapper:
"0.22.0/bio/samtools/sort"
reuseable wrappers from central repository
Reusable wrappers
Output handling
rule mytask:
input:
"data/{sample}.txt"
output:
temp("result/{sample}.txt")
shell:
"some-tool {input} > {output}"
Output handling
rule mytask:
input:
"data/{sample}.txt"
output:
protected("result/{sample}.txt")
shell:
"some-tool {input} > {output}"
Output handling
rule mytask:
input:
"data/{sample}.txt"
output:
pipe("result/{sample}.txt")
shell:
"some-tool {input} > {output}"
New in Snakemake 8: semantic helper functions
rule mytask:
input:
"data/counts.tsv"
output:
"results/diffexp/{model}.tsv"
params:
fdr=lookup(dpath="diffexp/{model}/fdr", within=config)
script:
"scripts/diffexp.R"
New in Snakemake 8: semantic helper functions
rule mytask:
input:
"data/{sample}.txt"
output:
"results/{sample}.txt"
params:
sample_props=lookup(query="sample == {sample}", within=samples)
script:
"scripts/somestep.py"
New in Snakemake 8: semantic helper functions
rule mytask:
input:
branch(
lookup(dpath="switches/mark_duplicates", within=config),
then="results/mapped/{sample}.bam",
otherwise="results/mapped/{sample}.rmdup.bam",
)
output:
"results/{sample}.txt"
script:
"scripts/somestep.py"
Using and combining workflows
configfile: "config/config.yaml"
module dna_seq:
snakefile:
github("snakemake-workflows/dna-seq-gatk-variant-calling", path="workflow/Snakefile", tag="v1.17.0")
config:
config
use rule * from dna_seq
# easily extend the workflow
rule plot_vafs:
input:
"filtered/all.vcf.gz"
output:
"results/plots/vafs.svg"
notebook:
"notebooks/plot-vafs.py.ipynb"
Using and combining workflows
configfile: "config/config.yaml"
module dna_seq:
snakefile:
github("snakemake-workflows/dna-seq-gatk-variant-calling", path="workflow/Snakefile", tag="v1.17.0")
config:
config["dna-seq"]
use rule * from dna_seq as dna_seq_*
# easily extend the workflow
rule plot_vafs:
input:
"filtered/all.vcf.gz"
output:
"results/plots/vafs.svg"
notebook:
"notebooks/plot-vafs.py.ipynb"
module rna_seq:
snakefile:
github("snakemake-workflows/rna-seq-kallisto-sleuth", path="workflow/Snakefile", tag="v1.0.0")
config:
config["rna-seq"]
use rule * from rna_seq as rna_seq_*
job selection
job resource usage
free resources
job temp file consumption
temp file lifetime fraction
job priority
job thread usage
Scheduling
temp file size
temp file deletion
DAG partitioning
--groups a=g1 b=g1
--groups a=g1 b=g1 --group-components g1=2
--groups a=g1 b=g1 --group-components g1=5
Scalable to any platform
workstation
compute server
cluster
grid computing
cloud computing
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
conda:
"envs/some-tool.yaml"
shell:
"some-tool {input} > {output}"
Conda integration
channels:
- conda-forge
dependencies:
- some-tool =2.3.1
- some-lib =1.1.2
Container integration
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
container:
"docker://biocontainers/some-tool#2.3.1"
shell:
"some-tool {input} > {output}"
Self-contained HTML reports
New in Snakemake 8:
a plugin architecture
Datavzrd: low-code
interactive result views
Input:
- set of tables
- relations between tables
- set of rendering definitions
Output:
- portable interactive visual presentation
- static printable version
name: My oscar report
default-view: oscars
datasets:
oscars:
path: "data/oscars.csv"
links:
link to oscar plot:
column: age
view: oscar-plot
link to movie:
column: movie
table-row: movies/Title
movies:
path: "data/movies.csv"
links:
link to oscar entry:
column: Title
table-row: oscars/movie
views:
oscars:
dataset: oscars
desc: |
### All winning oscars beginning in the year 1929.
This table contains *all* winning oscars for best actress and best actor.
page-size: 25
render-table:
columns:
age:
plot:
ticks:
scale: linear
domain:
- 20
- 100
name:
link-to-url: "https://lmgtfy.app/?q=Is {name} in {movie}?"
movie:
link-to-url: "https://de.wikipedia.org/wiki/{value}"
award:
plot:
heatmap:
scale: ordinal
domain:
- Best actor
- Best actress
range:
- "#add8e6"
- "#ffb6c1"
index(0):
display-mode: hidden
regex('birth_.+'):
display-mode: detail
movies:
dataset: movies
render-table:
columns:
Genre:
ellipsis: 15
imdbID:
link-to-url: "https://www.imdb.com/title/{value}/"
Title:
link-to-url: "https://de.wikipedia.org/wiki/{value}"
imdbRating:
precision: 1
plot:
bars:
scale: linear
domain:
- 1
- 10
Rated:
plot-view-legend: true
plot:
heatmap:
scale: ordinal
color-scheme: accent
oscar-plot:
dataset: oscars
desc: |
## My beautiful oscar scatter plot
*So many great actors and actresses*
render-plot:
spec-path: ".examples/specs/oscars.vl.json"
movies-plot:
dataset: movies
desc: |
All movies with its *runtime* and *ratings* plotted over *time*.
render-plot:
spec-path: ".examples/specs/movies.vl.json"
+
Snakemake workflow catalog
Conclusion
Snakemake covers all aspects of fully reproducible, transparent, and adaptable data analysis.
Datavzrd can be used to rapidly obtain visual interactive interfaces to tabular results.
https://snakemake.github.io
https://github.com/datavzrd/datavzrd
Transparency, reproducibility and the democratization of an ecosystem - the benefits of Snakemake 8
By Johannes Köster
Transparency, reproducibility and the democratization of an ecosystem - the benefits of Snakemake 8
The Snakemake workflow management system is a tool to create reproducible, scalable, transparent, and adaptable data analyses. Workflows are described via a human readable, Python based language. They can be seamlessly scaled to server, cluster, grid and cloud environments, without the need to modify the workflow definition. Finally, Snakemake workflows can entail a description of required software, which will be automatically deployed to any execution environment. With over 1 million downloads, and on average over 11 new citations per week (>2000 in total), Snakemake is a widely used and accepted standard for reproducible data science that has powered numerous high impact publications and is used in many important projects, from shaping our knowledge of the human genome, to virus surveillance during the SARS-CoV-2 pandemic, to particle physics at CERN. This talk will briefly introduce Snakemake, followed by detailed insights into recent ecosystem advances and best practices.
- 369