Snakemake Updates
Recent features and improvements
June 2021
Johannes Köster
https://koesterlab.github.io
>300k downloads since 2015
Snakemake
>1100 citations
>6 citations per week in 2020
GIAB, Nextstrain, ...
dataset
results
dataset
dataset
dataset
dataset
dataset
Define workflows
in terms of rules
Define workflows
in terms of rules
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
script:
"scripts/myscript.R"
rule myfiltration:
input:
"result/{dataset}.txt"
output:
"result/{dataset}.filtered.txt"
shell:
"mycommand {input} > {output}"
rule aggregate:
input:
"results/dataset1.filtered.txt",
"results/dataset2.filtered.txt"
output:
"plots/myplot.pdf"
script:
"scripts/myplot.R"
Define workflows
in terms of rules
Define workflows
in terms of rules
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
script:
"scripts/myscript.R"
rule myfiltration:
input:
"result/{dataset}.txt"
output:
"result/{dataset}.filtered.txt"
shell:
"mycommand {input} > {output}"
rule aggregate:
input:
"results/dataset1.filtered.txt",
"results/dataset2.filtered.txt"
output:
"plots/myplot.pdf"
script:
"scripts/myplot.R"
Beyond reproducibility
Adaptability
A new module system
module some_module:
snakefile:
"workflow/modules/some_module/Snakefile"
declare modules to be used in your workflow
A new module system
module some_module:
snakefile:
"workflow/modules/some_module/Snakefile"
use rule * from some_module
declare rule usage from module
A new module system
module some_module:
snakefile:
"workflow/modules/some_module/Snakefile"
use rule * from some_module
use rule map_reads from some_module with:
params:
sort="coordinate"
modify specific rules from the module
A new module system
configfile: "config/config.yaml"
module rna_seq:
snakefile:
"https://github.com/snakemake-workflows/rna-seq-kallisto-sleuth/raw/v2.0.1/workflow/Snakefile"
config:
config["rna-seq"]
module dna_seq:
snakefile:
"https://github.com/snakemake-workflows/dna-seq-gatk-variant-calling/raw/v2.0.1/Snakefile"
config:
config["dna-seq"]
use rule * from rna_seq as rna_seq_*
use rule * from dna_seq as dna_seq_*
easily combine multiple workflows into one
A new module system
configfile: "config/config.yaml"
module rna_seq:
snakefile:
"https://github.com/snakemake-workflows/rna-seq-kallisto-sleuth/raw/v2.0.1/workflow/Snakefile"
config:
config["rna-seq"]
module dna_seq:
snakefile:
"https://github.com/snakemake-workflows/dna-seq-gatk-variant-calling/raw/v2.0.1/Snakefile"
config:
config["dna-seq"]
use rule * from rna_seq as rna_seq_*
use rule * from dna_seq as dna_seq_*
rule some_integrated_analysis:
input:
calls="results/calls/all.vcf.gz"
diffexp="results/diffexp/all.tsv"
output:
"results/integrated-analysis/all.svg"
notebook:
"workflow/notebooks/integrated-analysis.r.ipynb"
make extensions and modifications transparent
Deployment
snakedeploy deploy-workflow https://github.com/snakemake-workflows/rna-seq-star-deseq2 \
. --tag v1.1.2
├── config
│ ├── config.yaml
│ ├── README.md
│ ├── samples.tsv
│ └── units.tsv
└── workflow
└── Snakefile
configfile: "config/config.yaml"
# declare https://github.com/snakemake-workflows/rna-seq-star-deseq2 as a module
module rna_seq_star_deseq2:
snakefile:
"https://github.com/snakemake-workflows/rna-seq-star-deseq2/raw/v1.1.2/workflow/Snakefile"
config:
config
# use all rules from https://github.com/snakemake-workflows/rna-seq-star-deseq2
use rule * from rna_seq_star_deseq2
Snakemake workflow catalog
Portability
Automatic containerization
fast and ad-hoc software stack definition
using conda packages
rule analyze_stuff:
input:
"resources/raw-data.tsv"
output:
"results/matrix.tsv"
conda:
"envs/pandas.yaml"
script:
"scripts/analyze-stuff.py.ipynb"
rule plot_stuff:
input:
"results/matrix.tsv"
output:
"results/plots/myplot.pdf"
conda:
"envs/ggplot.yaml"
notebook:
"notebooks/plot-stuff.r.ipynb"
Automatic containerization
containerization automatically yields a
transparent yet concise dockerfile
FROM condaforge/mambaforge:latest
LABEL io.github.snakemake.containerized="true"
LABEL io.github.snakemake.conda_env_hash="729e69b7e0a6c76ba7a5f69bd51474f68d37443999e0952f0e9d63bb0d9cfe92"
# Step 1: Retrieve conda environments
# Conda environment:
# source: envs/ggplot.yaml
# prefix: /conda-envs/dcef9d5a2891d184878bd1d9bde72a52
# channels:
# - conda-forge
# dependencies:
# - r-base 4.0
# - r-ggplot2 3.3
RUN mkdir -p /conda-envs/dcef9d5a2891d184878bd1d9bde72a52
COPY envs/ggplot.yaml /conda-envs/dcef9d5a2891d184878bd1d9bde72a52/environment.yaml
# Conda environment:
# source: envs/pandas.yaml
# prefix: /conda-envs/250d5a01e8ff0d636f8f5d03dee073b7
# channels:
# - conda-forge
# dependencies:
# - python 3.9
# - pandas 1.2
RUN mkdir -p /conda-envs/250d5a01e8ff0d636f8f5d03dee073b7
COPY envs/pandas.yaml /conda-envs/250d5a01e8ff0d636f8f5d03dee073b7/environment.yaml
# Step 2: Generate conda environments
RUN mamba env create --prefix /conda-envs/dcef9d5a2891d184878bd1d9bde72a52 --file /conda-envs/dcef9d5a2891d184878bd1d9bde72a52/environment.yaml && \
mamba env create --prefix /conda-envs/250d5a01e8ff0d636f8f5d03dee073b7 --file /conda-envs/250d5a01e8ff0d636f8f5d03dee073b7/environment.yaml && \
mamba clean --all -y
Automatic containerization
build, upload and use the resulting container image
containerized: "quay.io/some-username/my-workflow-image:1.0"
rule analyze_stuff:
input:
"resources/raw-data.tsv"
output:
"results/matrix.tsv"
conda:
"envs/pandas.yaml"
script:
"scripts/analyze-stuff.py.ipynb"
rule plot_stuff:
input:
"results/matrix.tsv"
output:
"results/plots/myplot.pdf"
conda:
"envs/ggplot.yaml"
notebook:
"notebooks/plot-stuff.r.ipynb"
Scalability
job selection
job resource usage
free resources
job temp file consumption
temp file lifetime fraction
job priority
job thread usage
Scheduling
temp file size
Runtime
snakemake --batch myrule=1/10
Divide workflow into batches:
Between workflow caching
dataset
results
dataset
dataset
dataset
dataset
dataset
shared data
Between workflow caching
DAG partitioning
Automation
rule mytask:
input:
"data/{sample}.txt"
output:
"result/{sample}.txt"
notebook:
"notebooks/mynotebook.ipynb"
- Integrated interactive edit mode.
- Automatic generalization for reuse in other jobs.
Jupyter notebook integration
Jupyter notebook integration
Transparency
Reports
Readability
Code linting
$ snakemake --lint
Lints for snakefile /tmp/tmpm_kodywk/PyPSA-pypsa-eur-265e939/Snakefile:
* Mixed rules and functions in same snakefile.:
Small one-liner functions used only once should be defined as lambda
expressions. Other functions should be collected in a common module, e.g.
'rules/common.smk'. This makes the workflow steps more readable.
Also see:
https://snakemake.readthedocs.io/en/latest/snakefiles/modularization.html#includes
* Path composition with '+' in line 8:
This becomes quickly unreadable. Usually, it is better to endure some
redundancy against having a more readable workflow. Hence, just repeat
common prefixes. If path composition is unavoidable, use pathlib or
(python >= 3.6) string formatting with f"...".
Lints for rule retrieve_databundle (line 102, /tmp/tmpm_kodywk/PyPSA-pypsa-eur-265e939/Snakefile):
* Specify a conda environment or container for each rule.:
This way, the used software for each specific step is documented, and the
workflow can be executed on any machine without prerequisites.
Also see:
https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#integrated-package-management
https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html#running-jobs-in-containers
Code formating
$ mamba install snakefmt
$ snakefmt workflow/Snakefile workflow/rules/*.smk
Snakemake Updates June 2021
By Johannes Köster
Snakemake Updates June 2021
- 1,219