Snakemake Tutorial
Johannes Köster
2024
https://koesterlab.github.io
dataset
results
Data analysis
"Let me do that by hand..."
dataset
results
dataset
dataset
dataset
dataset
dataset
"Let me do that by hand..."
Data analysis
dataset
results
dataset
dataset
dataset
dataset
dataset
automation
From raw data to final figures:
- document parameters, tools, versions
- execute without manual intervention
Reproducible data analysis
dataset
results
dataset
dataset
dataset
dataset
dataset
scalability
Handle parallelization:
- execute for tens to thousands of datasets
- efficiently use any computing platform
automation
Reproducible data analysis
dataset
results
dataset
dataset
dataset
dataset
dataset
Handle deployment:
be able to easily execute analyses on a different system/platform/infrastructure
portability
scalability
automation
Reproducible data analysis
Beyond reproducibility
Snakemake
>1 million downloads since 2015
>3000 citations
>11 citations per week in 2023
affiliated
Part 1
Basics
dataset
results
dataset
dataset
dataset
dataset
dataset
Define workflows
in terms of rules
Define workflows
in terms of rules
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
script:
"scripts/myscript.R"
rule myfiltration:
input:
"result/{dataset}.txt"
output:
"result/{dataset}.filtered.txt"
shell:
"mycommand {input} > {output}"
rule aggregate:
input:
"results/dataset1.filtered.txt",
"results/dataset2.filtered.txt"
output:
"plots/myplot.pdf"
script:
"scripts/myplot.R"
Define workflows
in terms of rules
Define workflows
in terms of rules
rule sort:
input:
"path/to/dataset.txt"
output:
"dataset.sorted.txt"
shell:
"sort {input} > {output}"
rule name
refer to input and output from shell command
how to create output from input
Define workflows
in terms of rules
rule sort:
input:
"path/to/{dataset}.txt"
output:
"{dataset}.sorted.txt"
shell:
"sort {input} > {output}"
generalize rules with
named wildcards
Define workflows
in terms of rules
rule sort_and_annotate:
input:
"path/to/{dataset}.txt",
"path/to/annotation.txt"
output:
"{dataset}.sorted.txt"
shell:
"paste <(sort {input[0]}) {input[1]} > {output}"
multiple input or output files
refer by index
Define workflows
in terms of rules
rule sort_and_annotate:
input:
a="path/to/{dataset}.txt",
b="path/to/annotation.txt"
output:
"{dataset}.sorted.txt"
shell:
"paste <(sort {input.a}) {input.b} > {output}"
name input and output files
refer by name
Define workflows
in terms of rules
rule sort:
input:
"path/to/{dataset}.txt"
output:
"{dataset}.sorted.txt"
run:
with open(output[0], "w") as out:
for l in sorted(open(input[0])):
print(l, file=out)
use Python in rules
External scripts
rule sort:
input:
"path/to/{dataset}.txt"
output:
"{dataset}.sorted.txt"
script:
"scripts/myscript.py"
refer to scripts:
- Python
- R
- Julia
- Rust
- Bash
External scripts
import pandas as pd
data = pd.read_table(snakemake.input[0])
data = data.sort_values("id")
data.to_csv(snakemake.output[0], sep="\t")
Python scripts:
External scripts
data <- read.table(snakemake@input[[1]])
data <- data[order(data$id),]
write.table(data, file = snakemake@output[[1]])
R scripts:
Jupyter notebooks
rule plot_histogram:
input:
"path/to/{dataset}.txt"
output:
"plots/{dataset}.hist.pdf"
notebook:
"notebooks/plot-histogram.py.ipynb"
draft, update or reuse notebook
Dependencies are determined top-down
Solution:
- For a given target, find a rule that can be applied to create it (a job).
- For the input files of the job, go on recursively.
- If the command specifies no target, Snakemake tries to apply the first rule in the workflow.
Problem:
for a given set of targets, find a composition of rules to create them
Dependencies are determined top-down
rule all:
input:
"D1.sorted.txt",
"D2.sorted.txt",
"D3.sorted.txt"
rule sort:
input:
"path/to/{dataset}.txt"
output:
"{dataset}.sorted.txt"
shell:
"sort {input} > {output}"
Job 1: apply rule all
(a target rule that just collects results)
Job i: apply rule sort to create i-th input of job 1
Dependencies are determined top-down
DATASETS = ["D1", "D2", "D3"]
rule all:
input:
[f"{dataset}.sorted.txt" for dataset in DATASETS]
rule sort:
input:
"path/to/{dataset}.txt"
output:
"{dataset}.sorted.txt"
shell:
"sort {input} > {output}"
Job 1: apply rule all
(a target rule that just collects results)
Job i: apply rule sort to create i-th input of job 1
use arbitrary Python code in your workflow
Dependencies are determined top-down
DATASETS = ["D1", "D2", "D3"]
rule all:
input:
expand("{dataset}.sorted.txt", dataset=DATASETS)
rule sort:
input:
"path/to/{dataset}.txt"
output:
"{dataset}.sorted.txt"
shell:
"sort {input} > {output}"
Job 1: apply rule all
(a target rule that just collects results)
Job i: apply rule sort to create i-th input of job 1
use arbitrary Python code in your workflow
Directed acyclic graph (DAG) of jobs
Job execution
A job is executed if and only if any of these conditions applies:
- output file is target and does not exist
- output file needed by another executed job and does not exist
- input file newer than output file
- rule has been modified
- input file will be updated by other job
- execution is enforced
checked with breadth-first-search on DAG of jobs
Command line interface
Assumption: rules defined in a Snakefile in the same directory or a workflow sub-directory.
# execute the workflow with target D1.sorted.txt
snakemake D1.sorted.txt --cores 1
# execute the workflow without target: first rule defines target
snakemake --cores 2
# dry-run
snakemake -n
# visualize the DAG of jobs using the Graphviz dot command
snakemake --dag | dot -Tsvg > dag.svg
Part 2
Advanced
https://doi.org/10.12688/f1000research.29032.1
Parallelization
Disjoint paths in the DAG of jobs can be executed in parallel.
# execute the workflow with 8 cores
snakemake --cores 8
execute 8 jobs in parallel?
Defining resources
rule sort:
input:
"path/to/{dataset}.txt"
output:
"{dataset}.sorted.txt"
threads: 4
resources: mem_mb=100
shell:
"sort --parallel {threads} {input} > {output}"
refer to defined thread number
define arbitrary additional resources
define used threads
Defining resources
rule sort:
input:
"path/to/{dataset}.txt"
output:
"{dataset}.sorted.txt"
threads: 4
resources: mem_mb=100
shell:
"sort --parallel {threads} {input} > {output}"
# execute the workflow with 8 cores
snakemake --cores 8
can execute 2 sort jobs in parallel
Defining resources
rule sort:
input:
"path/to/{dataset}.txt"
output:
"{dataset}.sorted.txt"
threads: 4
resources: mem_mb=100
shell:
"sort --parallel {threads} {input} > {output}"
# execute the workflow with 2 cores
snakemake --cores 2
can execute 1 sort job in parallel (automatically using 2 threads)
Defining resources
rule sort:
input:
"path/to/{dataset}.txt"
output:
"{dataset}.sorted.txt"
threads: 4
resources: mem_mb=100
shell:
"sort --parallel {threads} {input} > {output}"
# execute the workflow with 10 cores
snakemake --cores 10
can execute 2 sort jobs in parallel
Defining resources
rule sort:
input:
"path/to/{dataset}.txt"
output:
"{dataset}.sorted.txt"
threads: 4
resources: mem_mb=100
shell:
"sort --parallel {threads} {input} > {output}"
# execute the workflow with 10 cores and 100MB memory
snakemake --cores 10 --resources mem_mb=100
can execute 1 sort job in parallel
Scheduling
Available jobs are scheduled to
- maximize parallelization
- prefer high priority jobs
- minimize temp file lifetime
while satisfying resource constraints.
job selection
job resource usage
free resources
job temp file consumption
temp file lifetime fraction
job priority
job thread usage
Scheduling
temp file size
temp file deletion
Scalable to any platform
workstation
compute server
cluster
grid computing
cloud computing
Config files
configfile: "config.yaml"
rule all:
input:
expand("{dataset}.sorted.txt", dataset=config["datasets"])
rule sort:
input:
"path/to/{dataset}.txt"
output:
"{dataset}.sorted.txt"
threads: 4
resources: mem_mb=100
shell:
"sort --parallel {threads} {input} > {output}"
define config file
refer to config values
Input functions
Workflows are executed in three phases
- initialization phase (parsing)
- DAG phase (DAG is built)
- scheduling phase (execution of DAG)
Input functions defer determination of input files to the DAG phase
(when wildcard values are known).
Input functions
configfile: "config.yaml"
rule all:
input:
expand("{dataset}.sorted.txt", dataset=config["datasets"])
def get_sort_input(wildcards):
return config["datasets"][wildcards.dataset]
rule sort:
input:
get_sort_input
output:
"{dataset}.sorted.txt"
threads: 4
resources: mem_mb=100
shell:
"sort --parallel {threads} {input} > {output}"
- initialization phase (parsing)
- DAG phase (DAG is built)
- scheduling phase (execution of DAG)
New in Snakemake 8: semantic helper functions
rule mytask:
input:
branch(
lookup(dpath="switches/mark_duplicates", within=config),
then="results/mapped/{sample}.bam",
otherwise="results/mapped/{sample}.rmdup.bam",
)
output:
"results/{sample}.txt"
script:
"scripts/somestep.py"
Logging
configfile: "config.yaml"
rule all:
input:
expand("{dataset}.sorted.txt", dataset=config["datasets"])
def get_sort_input(wildcards):
return config["datasets"][wildcards.dataset]
rule sort:
input:
get_sort_input
output:
"{dataset}.sorted.txt"
log:
"logs/sort/{dataset}.log"
threads: 4
resources: mem_mb=100
shell:
"sort --parallel {threads} {input} > {output} 2> {log}"
define log file
refer to log file from shell command
Cluster/cloud execution
Configuration profiles
snakemake --profile lsf --jobs 1000
$HOME/.config/snakemake/lsf
├── config.yaml
├── jobscript.sh
├── status.py
└── submit.py
Reproducible software installation
dataset
results
dataset
dataset
dataset
dataset
dataset
Full reproducibility:
install required software and all dependencies in exact versions
Software installation is heterogeneous
source("https://bioconductor.org/biocLite.R")
biocLite("DESeq2")
easy_install snakemake
./configure --prefix=/usr/local
make
make install
cp lib/amd64/jli/*.so lib
cp lib/amd64/*.so lib
cp * $PREFIX
cpan -i bioperl
cmake ../../my_project \
-DCMAKE_MODULE_PATH=~/devel/seqan/util/cmake \
-DSEQAN_INCLUDE_PATH=~/devel/seqan/include
make
make install
apt-get install bwa
yum install python-h5py
install.packages("matrixpls")
Package management with
package:
name: seqtk
version: 1.2
source:
fn: v1.2.tar.gz
url: https://github.com/lh3/seqtk/archive/v1.2.tar.gz
requirements:
build:
- gcc
- zlib
run:
- zlib
about:
home: https://github.com/lh3/seqtk
license: MIT License
summary: Seqtk is a fast and lightweight tool for processing sequences
test:
commands:
- seqtk seq
Idea:
Normalize installation via recipes
#!/bin/bash
export C_INCLUDE_PATH=${PREFIX}/include
export LIBRARY_PATH=${PREFIX}/lib
make all
mkdir -p $PREFIX/bin
cp seqtk $PREFIX/bin
- source or binary
- recipe and build script
- package
Easy installation and management:
no admin rights needed
conda install pandas
conda update pandas
conda remove pandas
conda env create -f myenv.yaml -n myenv
Isolated environments:
channels:
- conda-forge
- nodefaults
dependencies:
- pandas ==0.20.3
- statsmodels ==0.8.0
- r-dplyr ==0.7.0
- r-base ==3.4.1
- python ==3.6.0
Package management with
- Over 8000 bioinformatics related packages (C, C++, Python, R, Perl, ...).
- Over 140 million downloads.
Over 650 contributors
CONDA-FORGE
Partner project for general purpose software:
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
conda:
"envs/mycommand.yaml"
shell:
"mycommand {input} > {output}"
Integration with Snakemake
channels:
- bioconda
- conda-forge
- nodefaults
dependencies:
- mycommand ==2.3.1
# automatic deployment of dependencies
snakemake --sdm conda
Integrated with popular workflow management systems
Conda Integration with WMS
Containers
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
container:
"docker://biocontainers/some-tool#2.3.1"
shell:
"some-tool {input} > {output}"
Containerization
containerized:
"docker://username/myworkflow:1.0.0"
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
conda:
"envs/some-tool.yaml"
shell:
"some-tool {input} > {output}"
snakemake --containerize > Dockerfile
Self-contained HTML reports
Wrappers
rule samtools_sort:
input:
"mapped/{sample}.bam"
output:
"mapped/{sample}.sorted.bam"
params:
"-m 4G"
threads: 8
wrapper:
"0.2.0/bio/samtools/sort"
# automatic deployment of dependencies
snakemake --sdm conda
Wrappers
Distribution of workflows
Git repository with
├── .gitignore
├── README.md
├── LICENSE.md
├── config
│ └── config.yaml
├── workflow
│ ├── envs
│ │ ├── env1.yaml
│ │ └── env2.yaml
│ ├── report
│ │ ├── fig1.rst
│ │ ├── fig2.rst
│ │ └── workflow.rst
│ ├── rules
│ │ ├── rules1.smk
│ │ └── rules2.smk
│ ├── scripts
│ │ ├── script1.py
│ │ └── script2.R
│ ├── notebooks
│ │ ├── notebook1.py.ipynb
│ │ └── notebook2.r.ipynb
│ └── Snakefile
├── resources
└── results
Using and combining workflows
configfile: "config/config.yaml"
module dna_seq:
snakefile:
"https://github.com/snakemake-workflows/dna-seq-gatk-variant-calling/raw/v2.0.1/Snakefile"
config:
config
use rule * from dna_seq
# easily extend the workflow
rule plot_vafs:
input:
"filtered/all.vcf.gz"
output:
"results/plots/vafs.svg"
notebook:
"notebooks/plot-vafs.py.ipynb"
Using and combining workflows
configfile: "config/config.yaml"
module dna_seq:
snakefile:
"https://github.com/snakemake-workflows/dna-seq-gatk-variant-calling/raw/v2.0.1/Snakefile"
config:
config["dna-seq"]
use rule * from dna_seq as dna_seq_*
# easily extend the workflow
rule plot_vafs:
input:
"filtered/all.vcf.gz"
output:
"results/plots/vafs.svg"
notebook:
"notebooks/plot-vafs.py.ipynb"
module rna_seq:
snakefile:
"https://github.com/snakemake-workflows/rna-seq-kallisto-sleuth/raw/v1.0.0/Snakefile"
config:
config["rna-seq"]
use rule * from rna_seq as rna_seq_*
Many additional features
- piping between jobs
- group jobs
- handling of temporary and protected files
- tracking of parameter and code changes
- remote file support (S3/Google, Dropbox, HTTPS, FTP, ...)
- dynamic conditional execution depending on output
Acknowledgements
Contributors:
Andreas Wilm
Anthony Underwood
Ryan Dale
David Alexander
Elias Kuthe
Elmar Pruesse
Hyeshik Chang
Jay Hesselberth
Jesper Foldager
John Huddleston
all users and supporters
Joona Lehtomäki
Justin Fear
Karel Brinda
Karl Gutwin
Kemal Eren
Kostis Anagnostopoulos
Kyle A. Beauchamp
Simon Ye
Tobias Marschall
Willem Ligtenberg
Development team:
Christopher Tomkins-Tinch
David Koppstein
Tim Booth
Manuel Holtgrewe
Christian Arnold
Wibowo Arindrarto
Rasmus Ågren
Kyle Meyer
Lance Parsons
Manuel Holtgrewe
Marcel Martin
Matthew Shirley
Mattias Franberg
Matt Shirley
Paul Moore
percyfal
Per Unneberg
Ryan C. Thompson
Ryan Dale
Sean Davis
Resources
Rolling paper:
https://doi.org/10.12688/f1000research.29032.1
Documentation and change log:
https://snakemake.readthedocs.io
Questions:
http://stackoverflow.com/questions/tagged/snakemake
Wrappers:
https://snakemake-wrappers.readthedocs.io
Gold standard workflows:
https://github.com/snakemake-workflows/docs
Workflow catalog:
https://snakemake.github.io/snakemake-workflow-catalog
Configuration profiles:
https://github.com/snakemake-profiles/doc
Code linting:
snakemake --lint
Code formatting:
https://github.com/snakemake/snakefmt
Command line help:
snakemake --help
Editors:
vscode, pycharm, theia, vim, emacs
Snakemake Tutorial
By Johannes Köster
Snakemake Tutorial
Tutorial slides that complement https://snakemake.readthedocs.io/en/stable/tutorial/tutorial.html
- 27,054