Snakemake Updates

Recent features and improvements

June 2021

 

Johannes Köster

 

https://koesterlab.github.io

>300k downloads since 2015

Snakemake

>1100 citations

>6 citations per week in 2020

     GIAB, Nextstrain, ...

dataset

results

dataset

dataset

dataset

dataset

dataset

Define workflows

in terms of rules

Define workflows

in terms of rules

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    script:
        "scripts/myscript.R"


rule myfiltration:
     input:
        "result/{dataset}.txt"
     output:
        "result/{dataset}.filtered.txt"
     shell:
        "mycommand {input} > {output}"


rule aggregate:
    input:
        "results/dataset1.filtered.txt",
        "results/dataset2.filtered.txt"
    output:
        "plots/myplot.pdf"
    script:
        "scripts/myplot.R"

Define workflows

in terms of rules

Define workflows

in terms of rules

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    script:
        "scripts/myscript.R"


rule myfiltration:
     input:
        "result/{dataset}.txt"
     output:
        "result/{dataset}.filtered.txt"
     shell:
        "mycommand {input} > {output}"


rule aggregate:
    input:
        "results/dataset1.filtered.txt",
        "results/dataset2.filtered.txt"
    output:
        "plots/myplot.pdf"
    script:
        "scripts/myplot.R"

Beyond reproducibility

Adaptability

A new module system

module some_module:
    snakefile:
        "workflow/modules/some_module/Snakefile"

declare modules to be used in your workflow

A new module system

module some_module:
    snakefile:
        "workflow/modules/some_module/Snakefile"


use rule * from some_module

declare rule usage from module

A new module system

module some_module:
    snakefile:
        "workflow/modules/some_module/Snakefile"


use rule * from some_module


use rule map_reads from some_module with:
    params:
        sort="coordinate"

modify specific rules from the module

A new module system

configfile: "config/config.yaml"

module rna_seq:
    snakefile:
        "https://github.com/snakemake-workflows/rna-seq-kallisto-sleuth/raw/v2.0.1/workflow/Snakefile"
    config:
        config["rna-seq"]

module dna_seq:
    snakefile:
        "https://github.com/snakemake-workflows/dna-seq-gatk-variant-calling/raw/v2.0.1/Snakefile"
    config:
        config["dna-seq"]
        

use rule * from rna_seq as rna_seq_*

use rule * from dna_seq as dna_seq_*
        

easily combine multiple workflows into one

A new module system

configfile: "config/config.yaml"

module rna_seq:
    snakefile:
        "https://github.com/snakemake-workflows/rna-seq-kallisto-sleuth/raw/v2.0.1/workflow/Snakefile"
    config:
        config["rna-seq"]

module dna_seq:
    snakefile:
        "https://github.com/snakemake-workflows/dna-seq-gatk-variant-calling/raw/v2.0.1/Snakefile"
    config:
        config["dna-seq"]
        

use rule * from rna_seq as rna_seq_*

use rule * from dna_seq as dna_seq_*


rule some_integrated_analysis:
    input:
        calls="results/calls/all.vcf.gz"
        diffexp="results/diffexp/all.tsv"
    output:
        "results/integrated-analysis/all.svg"
    notebook:
        "workflow/notebooks/integrated-analysis.r.ipynb"
        

make extensions and modifications transparent

Deployment

snakedeploy deploy-workflow https://github.com/snakemake-workflows/rna-seq-star-deseq2 \
                            . --tag v1.1.2
├── config
│   ├── config.yaml
│   ├── README.md
│   ├── samples.tsv
│   └── units.tsv
└── workflow
    └── Snakefile
configfile: "config/config.yaml"


# declare https://github.com/snakemake-workflows/rna-seq-star-deseq2 as a module
module rna_seq_star_deseq2:
    snakefile:
        "https://github.com/snakemake-workflows/rna-seq-star-deseq2/raw/v1.1.2/workflow/Snakefile"
    config:
        config


# use all rules from https://github.com/snakemake-workflows/rna-seq-star-deseq2
use rule * from rna_seq_star_deseq2

Snakemake workflow catalog

Portability

Automatic containerization

fast and ad-hoc software stack definition
using conda packages

rule analyze_stuff:
    input:
        "resources/raw-data.tsv"
    output:
        "results/matrix.tsv"
    conda:
        "envs/pandas.yaml"
    script:
        "scripts/analyze-stuff.py.ipynb"


rule plot_stuff:
    input:
        "results/matrix.tsv"
    output:
        "results/plots/myplot.pdf"
    conda:
        "envs/ggplot.yaml"
    notebook:
        "notebooks/plot-stuff.r.ipynb"

Automatic containerization

containerization automatically yields a
transparent yet concise dockerfile

FROM condaforge/mambaforge:latest
LABEL io.github.snakemake.containerized="true"
LABEL io.github.snakemake.conda_env_hash="729e69b7e0a6c76ba7a5f69bd51474f68d37443999e0952f0e9d63bb0d9cfe92"

# Step 1: Retrieve conda environments

# Conda environment:
#   source: envs/ggplot.yaml
#   prefix: /conda-envs/dcef9d5a2891d184878bd1d9bde72a52
#   channels:
#     - conda-forge
#   dependencies:
#     - r-base 4.0
#     - r-ggplot2 3.3
RUN mkdir -p /conda-envs/dcef9d5a2891d184878bd1d9bde72a52
COPY envs/ggplot.yaml /conda-envs/dcef9d5a2891d184878bd1d9bde72a52/environment.yaml

# Conda environment:
#   source: envs/pandas.yaml
#   prefix: /conda-envs/250d5a01e8ff0d636f8f5d03dee073b7
#   channels:
#     - conda-forge
#   dependencies:
#     - python 3.9
#     - pandas 1.2
RUN mkdir -p /conda-envs/250d5a01e8ff0d636f8f5d03dee073b7
COPY envs/pandas.yaml /conda-envs/250d5a01e8ff0d636f8f5d03dee073b7/environment.yaml

# Step 2: Generate conda environments

RUN mamba env create --prefix /conda-envs/dcef9d5a2891d184878bd1d9bde72a52 --file /conda-envs/dcef9d5a2891d184878bd1d9bde72a52/environment.yaml && \
    mamba env create --prefix /conda-envs/250d5a01e8ff0d636f8f5d03dee073b7 --file /conda-envs/250d5a01e8ff0d636f8f5d03dee073b7/environment.yaml && \
    mamba clean --all -y

Automatic containerization

build, upload and use the resulting container image

containerized: "quay.io/some-username/my-workflow-image:1.0"


rule analyze_stuff:
    input:
        "resources/raw-data.tsv"
    output:
        "results/matrix.tsv"
    conda:
        "envs/pandas.yaml"
    script:
        "scripts/analyze-stuff.py.ipynb"


rule plot_stuff:
    input:
        "results/matrix.tsv"
    output:
        "results/plots/myplot.pdf"
    conda:
        "envs/ggplot.yaml"
    notebook:
        "notebooks/plot-stuff.r.ipynb"

Scalability

\max T \cdot S \cdot \sum_{j \in J} x_j \cdot p_j + S \cdot \sum_{j \in J} x_j \cdot (u_{t,j}+1) + \sum_{f \in F} \delta_f \cdot S_f
\sum_{j \in J} x_j \cdot u_{r,j} \leq U_r \quad \forall r \in R
\delta_f \leq \frac{\sum_{j \in J} x_j \cdot z_{f,j}}{\sum_{j \in J} z_{f,j}} \quad\forall f \in F
\text{subject to:}

job selection

job resource usage

free resources

job temp file consumption

temp file lifetime fraction

job priority

job thread usage

Scheduling

temp file size

Runtime

snakemake --batch myrule=1/10

Divide workflow into batches:

Between workflow caching

dataset

results

dataset

dataset

dataset

dataset

dataset

shared data

Between workflow caching

DAG partitioning

Automation

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        "result/{sample}.txt"
    notebook:
        "notebooks/mynotebook.ipynb"
  1. Integrated interactive edit mode.
  2. Automatic generalization for reuse in other jobs.

Jupyter notebook integration

Jupyter notebook integration

Transparency

Reports

Readability

Code linting

$ snakemake --lint

Lints for snakefile /tmp/tmpm_kodywk/PyPSA-pypsa-eur-265e939/Snakefile:
    * Mixed rules and functions in same snakefile.:
      Small one-liner functions used only once should be defined as lambda
      expressions. Other functions should be collected in a common module, e.g.
      'rules/common.smk'. This makes the workflow steps more readable.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/modularization.html#includes
    * Path composition with '+' in line 8:
      This becomes quickly unreadable. Usually, it is better to endure some
      redundancy against having a more readable workflow. Hence, just repeat
      common prefixes. If path composition is unavoidable, use pathlib or
      (python >= 3.6) string formatting with f"...".


Lints for rule retrieve_databundle (line 102, /tmp/tmpm_kodywk/PyPSA-pypsa-eur-265e939/Snakefile):
    * Specify a conda environment or container for each rule.:
      This way, the used software for each specific step is documented, and the
      workflow can be executed on any machine without prerequisites.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
      https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#running-jobs-in-containers

Code formating

$ mamba install snakefmt

$ snakefmt workflow/Snakefile workflow/rules/*.smk

Snakemake Updates June 2021

By Johannes Köster

Snakemake Updates June 2021

  • 1,200