Transparency, reproducibility, scalability, and the democratization of an ecosystem - Snakemake in 2025

Johannes Köster

 

University of Duisburg-Essen

 

2025

  • check computational validity
  • apply same analysis to new data
  • check methodological validity
  • understand analysis

Data analysis

Reproducibility

Transparency

  • modify analysis
  • extend analysis

Adaptability

>1 million downloads since 2015

>3000 citations

>14 citations per week in 2024

dataset

results

dataset

dataset

dataset

dataset

dataset

Define workflows

in terms of rules

Define workflows

in terms of rules

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    script:
        "scripts/myscript.R"


rule myfiltration:
     input:
        "result/{dataset}.txt"
     output:
        "result/{dataset}.filtered.txt"
     shell:
        "mycommand {input} > {output}"


rule aggregate:
    input:
        "results/dataset1.filtered.txt",
        "results/dataset2.filtered.txt"
    output:
        "plots/myplot.pdf"
    script:
        "scripts/myplot.R"

Define workflows

in terms of rules

Define workflows

in terms of rules

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        "result/{sample}.txt"
    shell:
        "some-tool {input} > {output}"

rule name

how to create output from input

define

  • input
  • output
  • log files
  • parameters
  • resources
rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    script:
        "scripts/myscript.R"


rule myfiltration:
     input:
        "result/{dataset}.txt"
     output:
        "result/{dataset}.filtered.txt"
     shell:
        "mycommand {input} > {output}"


rule aggregate:
    input:
        "results/dataset1.filtered.txt",
        "results/dataset2.filtered.txt"
    output:
        "plots/myplot.pdf"
    script:
        "scripts/myplot.R"

Automatic inference of DAG of jobs

Boilerplate-free integration of scripts

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        "result/{sample}.txt"
    script:
        "scripts/myscript.py"

reusable scripts:

  • Python
  • R
  • Julia
  • Rust
  • Bash
import pandas as pd

data = pd.read_table(snakemake.input[0])
data = data.sort_values("id")
data.to_csv(snakemake.output[0], sep="\t")

Python:

data <- read.table(snakemake@input[[1]])
data <- data[order(data$id),]
write.table(data, file = snakemake@output[[1]])

Boilerplate-free integration of scripts

R:

import polar as pl

pl.read_csv(&snakemake.input[0])
  .sort()
  .to_csv(&snakemake.output[0])

Rust:

Reusable wrappers

rule map_reads:
    input:
        "{sample}.bam"
    output:
        "{sample}.sorted.bam"
    wrapper:
        "0.22.0/bio/samtools/sort"

reuseable wrappers from central repository

Reusable wrappers

\max U_t \cdot 2S \cdot \sum_{j \in J} x_j \cdot p_j + 2S \cdot \sum_{j \in J} x_j \cdot (u_{t,j}) + S \cdot \sum_{f \in F} \gamma_f \cdot S_f\\ + \sum_{f \in F} \delta_f \cdot S_f
\sum_{j \in J} x_j \cdot u_{r,j} \leq U_r \quad \forall r \in R
\delta_f \leq \frac{\sum_{j \in J} x_j \cdot z_{f,j}}{\sum_{j \in J} z_{f,j}} \quad\forall f \in F
\text{subject to:}

job selection

job resource usage

free resources

job temp file consumption

temp file lifetime fraction

job priority

job thread usage

Scheduling

temp file size

temp file deletion

\gamma_f \leq \delta_f \quad\forall f \in F
\gamma_f \in \{0,1\}
\delta_f \in [0,1]
x_f \in \{0,1\}

DAG partitioning

--groups a=g1 b=g1
--groups a=g1 b=g1
--group-components g1=2
--groups a=g1 b=g1
--group-components g1=5

Scalable to any platform

workstation

compute server

cluster

grid computing

cloud computing

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    conda:
        "envs/some-tool.yaml"
    shell:
        "some-tool {input} > {output}"

Conda integration

channels:
 - conda-forge
dependencies:
  - some-tool =2.3.1
  - some-lib =1.1.2

Container integration

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    container:
        "docker://biocontainers/some-tool#2.3.1"
    shell:
        "some-tool {input} > {output}"

Self-contained HTML reports

Interoperability with other WMS

rule chipseq_pipeline:
    input:
        input="design.csv",
        fasta="data/genome.fasta",
        gtf="data/genome.gtf",
        # any --<argname> pipeline file arguments can be given here as <argname>=<path>
    output:
        report="results/multiqc/broadPeak/multiqc_report.html",
    params:
        pipeline="nf-core/chipseq",
        revision="2.0.0",
        profile=["test", "docker"],
        outdir=subpath(output.report, ancestor=2),
        # any --<argname> pipeline arguments can be given here as <argname>=<value>
    handover: True
    wrapper:
        "v7.2.0/utils/nextflow"

Many more features

  • dynamic DAG rewiring
  • service jobs (providing sockets, loading databases, or ramdisks)
  • semantic helper functions for minimizing boilerplate code
  • fallible rules
  • caching of shared results across workflows
  • transparent handling of remote storage

Extensible architecture


from dataclasses import dataclass, field

from snakemake_interface_common.exceptions import WorkflowError
from snakemake_interface_report_plugins.reporter import ReporterBase
from snakemake_interface_report_plugins.settings import ReportSettingsBase


# Optional:
# Define additional settings for your reporter.
# They will occur in the Snakemake CLI as --report-<reporter-name>-<param-name>
# Omit this class if you don't need any.
# Make sure that all defined fields are Optional (or bool) and specify a default value
# of None (or False) or anything else that makes sense in your case.
@dataclass
class ReportSettings(ReportSettingsBase):
    myparam: Optional[int] = field(
        default=None,
        metadata={
            "help": "Some help text",
            # Optionally request that setting is also available for specification
            # via an environment variable. The variable will be named automatically as
            # SNAKEMAKE_REPORT_<reporter-name>_<param-name>, all upper case.
            # This mechanism should ONLY be used for passwords and usernames.
            # For other items, we rather recommend to let people use a profile
            # for setting defaults
            # (https://snakemake.readthedocs.io/en/stable/executing/cli.html#profiles).
            "env_var": False,
            # Optionally specify a function that parses the value given by the user.
            # This is useful to create complex types from the user input.
            "parse_func": ...,
            # If a parse_func is specified, you also have to specify an unparse_func
            # that converts the parsed value back to a string.
            "unparse_func": ...,
            # Optionally specify that setting is required when the reporter is in use.
            "required": True,
            # Optionally specify multiple args with "nargs": True
        },
    )


# Required:
# Implementation of your reporter
class Reporter(ReporterBase):
    def __post_init__(self):
        # initialize additional attributes
        # Do not overwrite the __init__ method as this is kept in control of the base
        # class in order to simplify the update process.
        # In particular, the settings of above ReportSettings class are accessible via
        # self.settings.

    def render(self):
        # Render the report, using attributes of the base class.
        ...

Extensible architecture

Extensible architecture

Resource handling

snakemake --default-resources --jobs 100
mem_mb="min(max(2*input.size_mb, 1000), 8000)"
disk_mb="max(2*input.size_mb, 1000) if input else 50000"
snakemake --default-resources mem_mb=... --jobs 100

Specifiy defaults:

Use builtin defaults:

Store in profile:

# /etc/xdg/snakemake/default/config.yaml:
default-resources:
  mem_mb: "min(max(2*input.size_mb, 1000), 8000)"
  ...

Define per rule:

# profiles/default/config.yaml:
set-resources:
  mytask:
    mem_mb: 16000
  ...

Coming soon: federated execution

Conclusion

Snakemake covers all aspects of fully reproducible, transparent, and adaptable data analysis, offering

  • maximum readability
  • ad-hoc integration with scripting and high performance languages
  • an extensible architecture
  • a plethora of advanced features

https://snakemake.github.io

Join the hackathon at TU Munich in March 2026!

Snakemake workflow catalog

snakemake-intro-updates-2025

By Johannes Köster

snakemake-intro-updates-2025

  • 13