Transparency, reproducibility, scalability, and the democratization of an ecosystem - Snakemake in 2025
Johannes Köster
University of Duisburg-Essen
2025
- check computational validity
- apply same analysis to new data
- check methodological validity
- understand analysis
Data analysis
Reproducibility
Transparency
- modify analysis
- extend analysis
Adaptability
>1 million downloads since 2015
>3000 citations
>14 citations per week in 2024


dataset
results
dataset
dataset
dataset
dataset
dataset
Define workflows
in terms of rules
Define workflows
in terms of rules
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
script:
"scripts/myscript.R"
rule myfiltration:
input:
"result/{dataset}.txt"
output:
"result/{dataset}.filtered.txt"
shell:
"mycommand {input} > {output}"
rule aggregate:
input:
"results/dataset1.filtered.txt",
"results/dataset2.filtered.txt"
output:
"plots/myplot.pdf"
script:
"scripts/myplot.R"
Define workflows
in terms of rules
Define workflows
in terms of rules
rule mytask:
input:
"data/{sample}.txt"
output:
"result/{sample}.txt"
shell:
"some-tool {input} > {output}"
rule name
how to create output from input
define
- input
- output
- log files
- parameters
- resources
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
script:
"scripts/myscript.R"
rule myfiltration:
input:
"result/{dataset}.txt"
output:
"result/{dataset}.filtered.txt"
shell:
"mycommand {input} > {output}"
rule aggregate:
input:
"results/dataset1.filtered.txt",
"results/dataset2.filtered.txt"
output:
"plots/myplot.pdf"
script:
"scripts/myplot.R"
Automatic inference of DAG of jobs
Boilerplate-free integration of scripts
rule mytask:
input:
"data/{sample}.txt"
output:
"result/{sample}.txt"
script:
"scripts/myscript.py"
reusable scripts:
- Python
- R
- Julia
- Rust
- Bash
import pandas as pd
data = pd.read_table(snakemake.input[0])
data = data.sort_values("id")
data.to_csv(snakemake.output[0], sep="\t")
Python:
data <- read.table(snakemake@input[[1]])
data <- data[order(data$id),]
write.table(data, file = snakemake@output[[1]])
Boilerplate-free integration of scripts
R:
import polar as pl
pl.read_csv(&snakemake.input[0])
.sort()
.to_csv(&snakemake.output[0])
Rust:
Reusable wrappers
rule map_reads:
input:
"{sample}.bam"
output:
"{sample}.sorted.bam"
wrapper:
"0.22.0/bio/samtools/sort"
reuseable wrappers from central repository
Reusable wrappers
job selection
job resource usage
free resources
job temp file consumption
temp file lifetime fraction
job priority
job thread usage
Scheduling
temp file size
temp file deletion
DAG partitioning
--groups a=g1 b=g1
--groups a=g1 b=g1 --group-components g1=2
--groups a=g1 b=g1 --group-components g1=5
Scalable to any platform



workstation
compute server
cluster






grid computing
cloud computing

rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
conda:
"envs/some-tool.yaml"
shell:
"some-tool {input} > {output}"
Conda integration
channels:
- conda-forge
dependencies:
- some-tool =2.3.1
- some-lib =1.1.2
Container integration
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
container:
"docker://biocontainers/some-tool#2.3.1"
shell:
"some-tool {input} > {output}"
Self-contained HTML reports
Interoperability with other WMS
rule chipseq_pipeline:
input:
input="design.csv",
fasta="data/genome.fasta",
gtf="data/genome.gtf",
# any --<argname> pipeline file arguments can be given here as <argname>=<path>
output:
report="results/multiqc/broadPeak/multiqc_report.html",
params:
pipeline="nf-core/chipseq",
revision="2.0.0",
profile=["test", "docker"],
outdir=subpath(output.report, ancestor=2),
# any --<argname> pipeline arguments can be given here as <argname>=<value>
handover: True
wrapper:
"v7.2.0/utils/nextflow"
Many more features
- dynamic DAG rewiring
- service jobs (providing sockets, loading databases, or ramdisks)
- semantic helper functions for minimizing boilerplate code
- fallible rules
- caching of shared results across workflows
- transparent handling of remote storage
Extensible architecture
from dataclasses import dataclass, field
from snakemake_interface_common.exceptions import WorkflowError
from snakemake_interface_report_plugins.reporter import ReporterBase
from snakemake_interface_report_plugins.settings import ReportSettingsBase
# Optional:
# Define additional settings for your reporter.
# They will occur in the Snakemake CLI as --report-<reporter-name>-<param-name>
# Omit this class if you don't need any.
# Make sure that all defined fields are Optional (or bool) and specify a default value
# of None (or False) or anything else that makes sense in your case.
@dataclass
class ReportSettings(ReportSettingsBase):
myparam: Optional[int] = field(
default=None,
metadata={
"help": "Some help text",
# Optionally request that setting is also available for specification
# via an environment variable. The variable will be named automatically as
# SNAKEMAKE_REPORT_<reporter-name>_<param-name>, all upper case.
# This mechanism should ONLY be used for passwords and usernames.
# For other items, we rather recommend to let people use a profile
# for setting defaults
# (https://snakemake.readthedocs.io/en/stable/executing/cli.html#profiles).
"env_var": False,
# Optionally specify a function that parses the value given by the user.
# This is useful to create complex types from the user input.
"parse_func": ...,
# If a parse_func is specified, you also have to specify an unparse_func
# that converts the parsed value back to a string.
"unparse_func": ...,
# Optionally specify that setting is required when the reporter is in use.
"required": True,
# Optionally specify multiple args with "nargs": True
},
)
# Required:
# Implementation of your reporter
class Reporter(ReporterBase):
def __post_init__(self):
# initialize additional attributes
# Do not overwrite the __init__ method as this is kept in control of the base
# class in order to simplify the update process.
# In particular, the settings of above ReportSettings class are accessible via
# self.settings.
def render(self):
# Render the report, using attributes of the base class.
...
Extensible architecture
Extensible architecture
Resource handling
snakemake --default-resources --jobs 100
mem_mb="min(max(2*input.size_mb, 1000), 8000)"
disk_mb="max(2*input.size_mb, 1000) if input else 50000"
snakemake --default-resources mem_mb=... --jobs 100
Specifiy defaults:
Use builtin defaults:
Store in profile:
# /etc/xdg/snakemake/default/config.yaml:
default-resources:
mem_mb: "min(max(2*input.size_mb, 1000), 8000)"
...
Define per rule:
# profiles/default/config.yaml:
set-resources:
mytask:
mem_mb: 16000
...
Coming soon: federated execution

Conclusion
Snakemake covers all aspects of fully reproducible, transparent, and adaptable data analysis, offering
- maximum readability
- ad-hoc integration with scripting and high performance languages
- an extensible architecture
- a plethora of advanced features
https://snakemake.github.io

Join the hackathon at TU Munich in March 2026!
Snakemake workflow catalog
snakemake-intro-updates-2025
By Johannes Köster
snakemake-intro-updates-2025
- 13