Snakemake Updates

Recent features and improvements

June 2021

Johannes Köster

https://koesterlab.github.io

>300k downloads since 2015

Snakemake

>1100 citations

>6 citations per week in 2020

GIAB, Nextstrain, ...

dataset

results

dataset

Define workflows

in terms of rules

Define workflows

in terms of rules

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    script:
        "scripts/myscript.R"


rule myfiltration:
     input:
        "result/{dataset}.txt"
     output:
        "result/{dataset}.filtered.txt"
     shell:
        "mycommand {input} > {output}"


rule aggregate:
    input:
        "results/dataset1.filtered.txt",
        "results/dataset2.filtered.txt"
    output:
        "plots/myplot.pdf"
    script:
        "scripts/myplot.R"

Define workflows

in terms of rules

Define workflows

in terms of rules

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    script:
        "scripts/myscript.R"


rule myfiltration:
     input:
        "result/{dataset}.txt"
     output:
        "result/{dataset}.filtered.txt"
     shell:
        "mycommand {input} > {output}"


rule aggregate:
    input:
        "results/dataset1.filtered.txt",
        "results/dataset2.filtered.txt"
    output:
        "plots/myplot.pdf"
    script:
        "scripts/myplot.R"

Beyond reproducibility

Adaptability

A new module system

module some_module:
    snakefile:
        "workflow/modules/some_module/Snakefile"

declare modules to be used in your workflow

A new module system

module some_module:
    snakefile:
        "workflow/modules/some_module/Snakefile"


use rule * from some_module

declare rule usage from module

A new module system

module some_module:
    snakefile:
        "workflow/modules/some_module/Snakefile"


use rule * from some_module


use rule map_reads from some_module with:
    params:
        sort="coordinate"

modify specific rules from the module

A new module system

configfile: "config/config.yaml"

module rna_seq:
    snakefile:
        "https://github.com/snakemake-workflows/rna-seq-kallisto-sleuth/raw/v2.0.1/workflow/Snakefile"
    config:
        config["rna-seq"]

module dna_seq:
    snakefile:
        "https://github.com/snakemake-workflows/dna-seq-gatk-variant-calling/raw/v2.0.1/Snakefile"
    config:
        config["dna-seq"]
        

use rule * from rna_seq as rna_seq_*

use rule * from dna_seq as dna_seq_*

easily combine multiple workflows into one

A new module system

configfile: "config/config.yaml"

module rna_seq:
    snakefile:
        "https://github.com/snakemake-workflows/rna-seq-kallisto-sleuth/raw/v2.0.1/workflow/Snakefile"
    config:
        config["rna-seq"]

module dna_seq:
    snakefile:
        "https://github.com/snakemake-workflows/dna-seq-gatk-variant-calling/raw/v2.0.1/Snakefile"
    config:
        config["dna-seq"]
        

use rule * from rna_seq as rna_seq_*

use rule * from dna_seq as dna_seq_*


rule some_integrated_analysis:
    input:
        calls="results/calls/all.vcf.gz"
        diffexp="results/diffexp/all.tsv"
    output:
        "results/integrated-analysis/all.svg"
    notebook:
        "workflow/notebooks/integrated-analysis.r.ipynb"

make extensions and modifications transparent

Deployment

snakedeploy deploy-workflow https://github.com/snakemake-workflows/rna-seq-star-deseq2 \
                            . --tag v1.1.2

├── config
│   ├── config.yaml
│   ├── README.md
│   ├── samples.tsv
│   └── units.tsv
└── workflow
    └── Snakefile

configfile: "config/config.yaml"


# declare https://github.com/snakemake-workflows/rna-seq-star-deseq2 as a module
module rna_seq_star_deseq2:
    snakefile:
        "https://github.com/snakemake-workflows/rna-seq-star-deseq2/raw/v1.1.2/workflow/Snakefile"
    config:
        config


# use all rules from https://github.com/snakemake-workflows/rna-seq-star-deseq2
use rule * from rna_seq_star_deseq2

Snakemake workflow catalog

Portability

Automatic containerization

fast and ad-hoc software stack definition
using conda packages

rule analyze_stuff:
    input:
        "resources/raw-data.tsv"
    output:
        "results/matrix.tsv"
    conda:
        "envs/pandas.yaml"
    script:
        "scripts/analyze-stuff.py.ipynb"


rule plot_stuff:
    input:
        "results/matrix.tsv"
    output:
        "results/plots/myplot.pdf"
    conda:
        "envs/ggplot.yaml"
    notebook:
        "notebooks/plot-stuff.r.ipynb"

Automatic containerization

containerization automatically yields a
transparent yet concise dockerfile

FROM condaforge/mambaforge:latest
LABEL io.github.snakemake.containerized="true"
LABEL io.github.snakemake.conda_env_hash="729e69b7e0a6c76ba7a5f69bd51474f68d37443999e0952f0e9d63bb0d9cfe92"

# Step 1: Retrieve conda environments

# Conda environment:
#   source: envs/ggplot.yaml
#   prefix: /conda-envs/dcef9d5a2891d184878bd1d9bde72a52
#   channels:
#     - conda-forge
#   dependencies:
#     - r-base 4.0
#     - r-ggplot2 3.3
RUN mkdir -p /conda-envs/dcef9d5a2891d184878bd1d9bde72a52
COPY envs/ggplot.yaml /conda-envs/dcef9d5a2891d184878bd1d9bde72a52/environment.yaml

# Conda environment:
#   source: envs/pandas.yaml
#   prefix: /conda-envs/250d5a01e8ff0d636f8f5d03dee073b7
#   channels:
#     - conda-forge
#   dependencies:
#     - python 3.9
#     - pandas 1.2
RUN mkdir -p /conda-envs/250d5a01e8ff0d636f8f5d03dee073b7
COPY envs/pandas.yaml /conda-envs/250d5a01e8ff0d636f8f5d03dee073b7/environment.yaml

# Step 2: Generate conda environments

RUN mamba env create --prefix /conda-envs/dcef9d5a2891d184878bd1d9bde72a52 --file /conda-envs/dcef9d5a2891d184878bd1d9bde72a52/environment.yaml && \
    mamba env create --prefix /conda-envs/250d5a01e8ff0d636f8f5d03dee073b7 --file /conda-envs/250d5a01e8ff0d636f8f5d03dee073b7/environment.yaml && \
    mamba clean --all -y

Automatic containerization

build, upload and use the resulting container image

containerized: "quay.io/some-username/my-workflow-image:1.0"


rule analyze_stuff:
    input:
        "resources/raw-data.tsv"
    output:
        "results/matrix.tsv"
    conda:
        "envs/pandas.yaml"
    script:
        "scripts/analyze-stuff.py.ipynb"


rule plot_stuff:
    input:
        "results/matrix.tsv"
    output:
        "results/plots/myplot.pdf"
    conda:
        "envs/ggplot.yaml"
    notebook:
        "notebooks/plot-stuff.r.ipynb"

Scalability

\max T \cdot S \cdot \sum_{j \in J} x_j \cdot p_j + S \cdot \sum_{j \in J} x_j \cdot (u_{t,j}+1) + \sum_{f \in F} \delta_f \cdot S_f

\max T \cdot S \cdot \sum_{j \in J} x_j \cdot p_j + S \cdot \sum_{j \in J} x_j \cdot (u_{t,j}+1) + \sum_{f \in F} \delta_f \cdot S_f

\sum_{j \in J} x_j \cdot u_{r,j} \leq U_r \quad \forall r \in R

\sum_{j \in J} x_j \cdot u_{r,j} \leq U_r \quad \forall r \in R

\delta_f \leq \frac{\sum_{j \in J} x_j \cdot z_{f,j}}{\sum_{j \in J} z_{f,j}} \quad\forall f \in F

\delta_f \leq \frac{\sum_{j \in J} x_j \cdot z_{f,j}}{\sum_{j \in J} z_{f,j}} \quad\forall f \in F

\text{subject to:}

\text{subject to:}

job selection

job resource usage

free resources

job temp file consumption

temp file lifetime fraction

job priority

job thread usage

Scheduling

temp file size

Runtime

snakemake --batch myrule=1/10

Divide workflow into batches:

Between workflow caching

dataset

results

dataset

shared data

Between workflow caching

DAG partitioning

Automation

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        "result/{sample}.txt"
    notebook:
        "notebooks/mynotebook.ipynb"

Integrated interactive edit mode.
Automatic generalization for reuse in other jobs.

Jupyter notebook integration

Transparency

Reports

Readability

Code linting

$ snakemake --lint

Lints for snakefile /tmp/tmpm_kodywk/PyPSA-pypsa-eur-265e939/Snakefile:
    * Mixed rules and functions in same snakefile.:
      Small one-liner functions used only once should be defined as lambda
      expressions. Other functions should be collected in a common module, e.g.
      'rules/common.smk'. This makes the workflow steps more readable.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/modularization.html#includes
    * Path composition with '+' in line 8:
      This becomes quickly unreadable. Usually, it is better to endure some
      redundancy against having a more readable workflow. Hence, just repeat
      common prefixes. If path composition is unavoidable, use pathlib or
      (python >= 3.6) string formatting with f"...".


Lints for rule retrieve_databundle (line 102, /tmp/tmpm_kodywk/PyPSA-pypsa-eur-265e939/Snakefile):
    * Specify a conda environment or container for each rule.:
      This way, the used software for each specific step is documented, and the
      workflow can be executed on any machine without prerequisites.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#running-jobs-in-containers

Code formating

$ mamba install snakefmt

$ snakefmt workflow/Snakefile workflow/rules/*.smk

Snakemake

Define workflows

in terms of rules

Define workflows

in terms of rules

Define workflows

in terms of rules

Define workflows

in terms of rules

Beyond reproducibility

Adaptability

A new module system

A new module system

A new module system

A new module system

A new module system

Deployment

Snakemake workflow catalog

Portability

Automatic containerization

Automatic containerization

Automatic containerization

Scalability

Scheduling

Runtime

Between workflow caching

Between workflow caching

DAG partitioning

Automation

Jupyter notebook integration

Jupyter notebook integration

Transparency

Reports

Readability

Code linting

Code formating

Snakemake Updates June 2021

Snakemake Updates June 2021

Johannes Köster PRO

Snakemake

Define workflows

in terms of rules

Define workflows

in terms of rules

Define workflows

in terms of rules

Define workflows

in terms of rules

Beyond reproducibility

Adaptability

A new module system

A new module system

A new module system

A new module system

A new module system

Deployment

Snakemake workflow catalog

Portability

Automatic containerization

Automatic containerization

Automatic containerization

Scalability

Scheduling

Runtime

Between workflow caching

Between workflow caching

DAG partitioning

Automation

Jupyter notebook integration

Jupyter notebook integration

Transparency

Reports

Readability

Code linting

Code formating

Snakemake Updates June 2021

More from Johannes Köster