snakemake-intro-updates-2026

Snakemake in 2026

Johannes Köster

University of Duisburg-Essen

2025

check computational validity
apply same analysis to new data

check methodological validity
understand analysis

Data analysis

Reproducibility

Transparency

modify analysis
extend analysis

Adaptability

>1.6 million downloads since 2015

>3000 citations

>14 citations per week in 2024

dataset

results

dataset

Define workflows

in terms of rules

Define workflows

in terms of rules

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    script:
        "scripts/myscript.R"


rule myfiltration:
     input:
        "result/{dataset}.txt"
     output:
        "result/{dataset}.filtered.txt"
     shell:
        "mycommand {input} > {output}"


rule aggregate:
    input:
        "results/dataset1.filtered.txt",
        "results/dataset2.filtered.txt"
    output:
        "plots/myplot.pdf"
    script:
        "scripts/myplot.R"

Define workflows

in terms of rules

Define workflows

in terms of rules

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        "result/{sample}.txt"
    shell:
        "some-tool {input} > {output}"

rule name

how to create output from input

define

input
output
log files
parameters
resources

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    script:
        "scripts/myscript.R"


rule myfiltration:
     input:
        "result/{dataset}.txt"
     output:
        "result/{dataset}.filtered.txt"
     shell:
        "mycommand {input} > {output}"


rule aggregate:
    input:
        "results/dataset1.filtered.txt",
        "results/dataset2.filtered.txt"
    output:
        "plots/myplot.pdf"
    script:
        "scripts/myplot.R"

Automatic inference of DAG of jobs

Boilerplate-free integration of scripts

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        "result/{sample}.txt"
    script:
        "scripts/myscript.py"

reusable scripts:

Python
R
Julia
Rust
Bash
Xonsh
Hy

import pandas as pd

data = pd.read_table(snakemake.input[0])
data = data.sort_values("id")
data.to_csv(snakemake.output[0], sep="\t")

Python:

data <- read.table(snakemake@input[[1]])
data <- data[order(data$id),]
write.table(data, file = snakemake@output[[1]])

Boilerplate-free integration of scripts

import polar as pl

pl.read_csv(&snakemake.input[0])
  .sort()
  .to_csv(&snakemake.output[0])

Rust:

$XONSH_TRACEBACK_LOGFILE = snakemake.log[0]

xsv sort -s id @(snakemake.input[0]) > @(snakemake.output[0])

Xonsh:

(import pandas :as pd)

(setv data (pd.read_table (get snakemake.input 0)))
(setv data (.sort_values data "id"))
(.to_csv data (get snakemake.output 0) :sep "\t")

Hy:

#!/usr/bin/env bash
exec 2> "${snakemake_log[0]}"

xsv sort -s id ${snakemake_input[0]} > ${snakemake_output[0]}

Bash:

new!

Reusable wrappers

rule map_reads:
    input:
        "{sample}.bam"
    output:
        "{sample}.sorted.bam"
    wrapper:
        "0.22.0/bio/samtools/sort"

reuseable wrappers from central repository

Reusable wrappers

\max U_t \cdot 2S \cdot \sum_{j \in J} x_j \cdot p_j + 2S \cdot \sum_{j \in J} x_j \cdot (u_{t,j}) + S \cdot \sum_{f \in F} \gamma_f \cdot S_f\\ + \sum_{f \in F} \delta_f \cdot S_f

\sum_{j \in J} x_j \cdot u_{r,j} \leq U_r \quad \forall r \in R

\delta_f \leq \frac{\sum_{j \in J} x_j \cdot z_{f,j}}{\sum_{j \in J} z_{f,j}} \quad\forall f \in F

\text{subject to:}

job selection

job resource usage

free resources

job temp file consumption

temp file lifetime fraction

job priority

job thread usage

Scheduling

temp file size

temp file deletion

\gamma_f \leq \delta_f \quad\forall f \in F

\gamma_f \in \{0,1\}

\delta_f \in [0,1]

x_f \in \{0,1\}

DAG partitioning

--groups a=g1 b=g1

--groups a=g1 b=g1
--group-components g1=2

--groups a=g1 b=g1
--group-components g1=5

Resource handling

snakemake --default-resources --jobs 100

mem_mb="min(max(2*input.size_mb, 1000), 8000)"
disk_mb="max(2*input.size_mb, 1000) if input else 50000"

snakemake --default-resources mem_mb=... --jobs 100

Specifiy defaults:

Use builtin defaults:

Store in profile:

# /etc/xdg/snakemake/default/config.yaml:
default-resources:
  mem_mb: "min(max(2*input.size_mb, 1000), 8000)"
  ...

Define per rule:

# profiles/default/config.yaml:
set-resources:
  mytask:
    mem_mb: 16000
  ...

Scalable to any platform

workstation

compute server

cluster

grid computing

cloud computing

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    conda:
        "envs/some-tool.yaml"
    shell:
        "some-tool {input} > {output}"

Conda integration

channels:
 - conda-forge
dependencies:
  - some-tool =2.3.1
  - some-lib =1.1.2

Container integration

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    container:
        "docker://biocontainers/some-tool:2.3.1"
    shell:
        "some-tool {input} > {output}"

Self-contained HTML reports

Interoperability with other WMS

rule chipseq_pipeline:
    input:
        input="design.csv",
        fasta="data/genome.fasta",
        gtf="data/genome.gtf",
        # any --<argname> pipeline file arguments can be given here as <argname>=<path>
    output:
        report="results/multiqc/broadPeak/multiqc_report.html",
    params:
        pipeline="nf-core/chipseq",
        revision="2.0.0",
        profile=["test", "docker"],
        outdir=subpath(output.report, ancestor=2),
        # any --<argname> pipeline arguments can be given here as <argname>=<value>
    handover: True
    wrapper:
        "v7.2.0/utils/nextflow"

Lookup in config





rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    params:
        some_threshold=lookup(
            dpath="some_tool/thresholds/{dataset}",
            within=config,
            default=0.1
        )
    shell:
        "some-tool {input} > {output}"

new!

Lookup in sheet





rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    params:
        some_threshold=lookup(
            query="dataset == '{dataset}'",
            cols="threshold",
            within=sheet
        )
    shell:
        "some-tool {input} > {output}"

new!

Branching





rule mytask:
    input:
        branch(
            lookup(dpath="prefilter/activate", within=config),
            then="results/preprocessed/{dataset}",
            otherwise="path/to/{dataset}"
        )
    output:
        "result/{dataset}.txt"
    shell:
        "some-tool {input} > {output}"

new!

Rule item access and subpaths





rule mytask:
    input:
        branch(
            lookup(dpath="prefilter/activate", within=config),
            then="results/preprocessed/{dataset}",
            otherwise="path/to/{dataset}"
        )
    output:
        foo="result/{dataset}.txt"
    params:
        outdir=subpath(output.foo, parent=True)
    shell:
        "some-tool {input} -o {params.outdir}"

new!

Rule item access and subpaths





rule mytask:
    input:
        branch(
            lookup(dpath="prefilter/activate", within=config),
            then="results/preprocessed/{dataset}",
            otherwise="path/to/{dataset}"
        )
    output:
        foo="result/{dataset}.txt"
    params:
        prefix=subpath(output.foo, strip_suffix=".txt")
    shell:
        "some-tool {input} -o {params.prefix}"

new!

Pathvars





rule mytask:
    input:
        "<resources>/{dataset}.txt"
    output:
        "<results>/{dataset}.txt"
    shell:
        "some-tool {input} > {output}"

new!

# config.yaml

pathvars:
  results: some/custom/path/results
  resources: some/custom/path/resources

Pathvars





module star_arriba:
    meta_wrapper: "v9.0.1/meta/bio/star_arriba"
    pathvars:
        results="...", # Path to results directory
        resources="...", # Path to resources directory
        logs="...", # Path to logs directory
        genome_sequence="...", # Path to FASTA file with genome sequence
        genome_annotation="...", # Path to GTF file with genome annotation
        reads_r1="...", # Path/pattern for FASTQ files with R1 reads
        reads_r2="...", # Path/pattern for FASTQ files with R2 reads
        per="...", # Pattern for sample identifiers, e.g. ``"{sample}"``


use rule * from star_arriba as star_arriba_*

new!

Access pattern annotation





inputflags:
    access.sequential


rule mytask:
    input:
        access.random("<resources>/{dataset}.txt")
    output:
        "<results>/{dataset}.txt"
    shell:
        "some-tool {input} > {output}"

new!

Many more features

dynamic DAG rewiring
service jobs (providing sockets, loading databases, or ramdisks)
semantic helper functions for minimizing boilerplate code
fallible rules
caching of shared results across workflows
transparent handling of remote storage

Extensible architecture


from dataclasses import dataclass, field

from snakemake_interface_common.exceptions import WorkflowError
from snakemake_interface_report_plugins.reporter import ReporterBase
from snakemake_interface_report_plugins.settings import ReportSettingsBase


# Optional:
# Define additional settings for your reporter.
# They will occur in the Snakemake CLI as --report-<reporter-name>-<param-name>
# Omit this class if you don't need any.
# Make sure that all defined fields are Optional (or bool) and specify a default value
# of None (or False) or anything else that makes sense in your case.
@dataclass
class ReportSettings(ReportSettingsBase):
    myparam: Optional[int] = field(
        default=None,
        metadata={
            "help": "Some help text",
            # Optionally request that setting is also available for specification
            # via an environment variable. The variable will be named automatically as
            # SNAKEMAKE_REPORT_<reporter-name>_<param-name>, all upper case.
            # This mechanism should ONLY be used for passwords and usernames.
            # For other items, we rather recommend to let people use a profile
            # for setting defaults
            # (https://snakemake.readthedocs.io/en/stable/executing/cli.html#profiles).
            "env_var": False,
            # Optionally specify a function that parses the value given by the user.
            # This is useful to create complex types from the user input.
            "parse_func": ...,
            # If a parse_func is specified, you also have to specify an unparse_func
            # that converts the parsed value back to a string.
            "unparse_func": ...,
            # Optionally specify that setting is required when the reporter is in use.
            "required": True,
            # Optionally specify multiple args with "nargs": True
        },
    )


# Required:
# Implementation of your reporter
class Reporter(ReporterBase):
    def __post_init__(self):
        # initialize additional attributes
        # Do not overwrite the __init__ method as this is kept in control of the base
        # class in order to simplify the update process.
        # In particular, the settings of above ReportSettings class are accessible via
        # self.settings.

    def render(self):
        # Render the report, using attributes of the base class.
        ...

Extensible architecture

Conclusion

Snakemake covers all aspects of fully reproducible, transparent, and adaptable data analysis, offering

maximum readability
ad-hoc integration with scripting and high performance languages
an extensible architecture
a plethora of advanced features

https://snakemake.github.io

Notes

filenames: use directory hierarchies, not everything has to be encoded into the filename
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#parameter-space-exploration
dashboard (WIP): https://github.com/cademirch/snkmt

Outlook

structured wildcards (dataclasses as wildcard values, auto-encoded into directory hierarchy)
development of Rust-based core has started
interactive TUI
more plugins
minimize metadata footprint
rattler-based conda plugin:
- smart caching
- admin-provided hardware-optimized local builds

Snakemake in 2026

Data analysis

Define workflows

in terms of rules

Define workflows

in terms of rules

Define workflows

in terms of rules

Define workflows

in terms of rules

Automatic inference of DAG of jobs

Boilerplate-free integration of scripts

Boilerplate-free integration of scripts

Reusable wrappers

Reusable wrappers

Scheduling

DAG partitioning

Resource handling

Scalable to any platform

Conda integration

Container integration

Self-contained HTML reports

Interoperability with other WMS

Lookup in config

Lookup in sheet

Branching

Rule item access and subpaths

Rule item access and subpaths

Pathvars

Pathvars

Access pattern annotation

Many more features

Extensible architecture

Extensible architecture

Extensible architecture

Conclusion

Notes

Outlook

Snakemake workflow catalog