Johannes Köster
2015
Analyses usually entail the application of various tools, algorithms and scripts.
Workflow management
handles boilerplate:
Snakemake infers dependencies and execution order.
text based:
Python + domain specific syntax
Decompose workflow into rules.
Rules define how to obtain output files from input files.
~1000 downloads per week
rule sort:
input:
"path/to/dataset.txt"
output:
"dataset.sorted.txt"
shell:
"sort {input} > {output}"
rule name
refer to input and output from shell command
how to create output from input
rule sort:
input:
"path/to/{dataset}.txt"
output:
"{dataset}.sorted.txt"
shell:
"sort {input} > {output}"
generalize rules with
named wildcards
rule sort_and_annotate:
input:
"path/to/{dataset}.txt",
"path/to/annotation.txt"
output:
"{dataset}.sorted.txt"
shell:
"paste <(sort {input[0]}) {input[1]} > {output}"
multiple input or output files
refer by index
rule sort_and_annotate:
input:
a="path/to/{dataset}.txt",
b="path/to/annotation.txt"
output:
"{dataset}.sorted.txt"
shell:
"paste <(sort {input.a}) {input.b} > {output}"
name input and output files
refer by name
rule sort:
input:
a="path/to/{dataset}.txt"
output:
b="{dataset}.sorted.txt"
run:
with open(output.b, "w") as out:
for l in sorted(open(input.a)):
print(l, file=out)
use Python in rules
rule sort:
input:
a="path/to/{dataset}.txt"
output:
b="{dataset}.sorted.txt"
script:
"scripts/myscript.R"
refer to Python or R scripts
(in version 3.5)
DATASETS = ["D1", "D2", "D3"]
rule all:
input:
# returns ["D1.sorted.txt", "D2.sorted.txt", ...]
expand("{dataset}.sorted.txt", dataset=DATASETS)
rule sort:
input:
"path/to/{dataset}.txt"
output:
"{dataset}.sorted.txt"
shell:
"sort {input} > {output}"
Job 1: apply rule all
(a target rule that just collects results)
Job i: apply rule sort to create i-th input of job 1
use arbitrary Python code in your workflow
A job is executed if and only if
determined via breadth-first-search on DAG of jobs
Assumption: workflow defined in a Snakefile in the same directory.
# execute the workflow with target D1.sorted.txt
snakemake D1.sorted.txt
# execute the workflow without target: first rule defines target
snakemake
# dry-run
snakemake -n
# dry-run, print shell commands
snakemake -n -p
# dry-run, print execution reason for each job
snakemake -n -r
# visualize the DAG of jobs using the Graphviz dot command
snakemake --dag | dot -Tsvg > dag.svg
Disjoint paths in the DAG of jobs can be executed in parallel.
# execute the workflow with 8 cores
snakemake --cores 8
execute 8 jobs in parallel?
rule sort:
input:
"path/to/{dataset}.txt"
output:
"{dataset}.sorted.txt"
threads: 4
resources: mem_mb=100
shell:
"sort --parallel {threads} {input} > {output}"
refer to defined thread number
define arbitrary additional resources
define used threads
Assumption: workflow defined in a Snakefile in the same directory.
# execute the workflow with 8 cores
snakemake --cores 8
# prioritize the creation of a certain file
snakemake --prioritize D1.sorted.txt --cores 8
# execute the workflow with 8 cores and 100MB memory
snakemake --cores 8 --resources mem_mb=100
can execute 2 sort jobs in parallel
can execute only 1 sort job in parallel
Available jobs are scheduled to
s.t.
available jobs
priority
descendants
input size
resource usage
free resource (e.g. CPU cores)
configfile: "config.yaml"
rule all:
input:
expand("{dataset}.sorted.txt", dataset=config["datasets"])
rule sort:
input:
"path/to/{dataset}.txt"
output:
"{dataset}.sorted.txt"
shell:
"sort {input} > {output}"
define config file
refer to config values
configfile: "config.yaml"
rule all:
input:
expand("{dataset}.sorted.txt", dataset=config["datasets"])
rule sort:
input:
lambda wildcards: config["datasets"][wildcards.dataset]
output:
"{dataset}.sorted.txt"
log:
"logs/sort/{dataset}.log"
shell:
"sort {input} > {output} 2> {log}"
define log file
refer to log file from shell command
# execute the workflow on cluster with qsub submission command
# (and up to 100 parallel jobs)
snakemake --cluster qsub --jobs 100
# tell the cluster system about the used threads
snakemake --cluster "qsub -pe threaded {threads}" --jobs 100
# execute the workflow with synchronized qsub
snakemake --cluster-sync "qsub -sync yes" --jobs 100
# execute the workflow with DRMAA
snakemake --drmaa --jobs 100
Workflows are executed in three phases
Input functions defer determination of input files to the DAG phase
(when wildcard values are known).
configfile: "config.yaml"
rule all:
input:
expand("{dataset}.sorted.txt", dataset=config["datasets"])
rule sort:
input:
lambda wildcards: config["datasets"][wildcards.dataset]
output:
"{dataset}.sorted.txt"
threads: 4
resources: mem_mb=100
shell:
"sort --parallel {threads} {input} > {output}"
input functions take the determined wildcard values as only argument
Solution 1: Git repository with
├── config.yaml
├── requirements.txt
├── scripts
│ ├── script1.py
│ └── script2.R
└── Snakefile
# clone workflow into working directory
git clone https://bitbucket.org/user/myworkflow.git path/to/workdir
cd path/to/workdir
# edit config and workflow as needed
vim config.yaml
# install dependencies into isolated environment
conda create -n myworkflow --file requirements.txt
# activate environment
source activate myworkflow
# execute workflow
snakemake -n
Solution 2: Python/Conda package
# install workflow with all dependencies into isolated environment
conda create -n myworkflow myworkflow
# activate environment
source activate myworkflow
# copy Snakefile and config file into working directory
myworkflow init path/to/workdir
cd path/to/workdir
# edit config and workflow as needed
vim config.yaml
# execute workflow
snakemake -n
Solution 3: Hide workflow in package
# install workflow with all dependencies into isolated environment
conda create -n myworkflow myworkflow
# activate environment
source activate myworkflow
# copy only config file into working directory
myworkflow init path/to/workdir
cd path/to/workdir
# edit config as needed
vim config.yaml
# execute workflow with a wrapper that uses a Snakefile
# hidden in the package and delegates execution to Snakemake
myworkflow run -n
Sven Rahmann, Universität Duisburg-Essen
Christopher Schröder, Universität Duisburg-Essen
Marcel Martin, SciLifeLab Stockholm
Tobias Marschall, Max Planck Institute for Informatics
Sean Davis, NIH
David Koppstein, AbVitro
Ryan Dale, NIH
Chris Tomkins-Tinch, Broad Institute
Hyeshik Chang, Seoul National University
Karel Brinda, Université Paris-Est Marne-la-Vallée
Anthony Underwood, Public Health England
Elias Kuthe, TU Dortmund
Paul Moore, Atos SE
Mattias Frånberg, Karolinska Institute
Simon Ye, MIT
Willem Ligtenberg, Open Analytics
Per Unneberg, SciLifeLab Stockholm
Matthew Shirley, Johns Hopkins School of Medicine
Jeremy Leipzig, Childrens Hospital of Philadelphia
Kemal Eren
Kyle Meyer
all users and supporters
https://bitbucket.org/johanneskoester/snakemake
Köster, Johannes and Rahmann, Sven. "Snakemake - A scalable bioinformatics workflow engine". Bioinformatics 2012.
Köster, Johannes. "Parallelization, Scalability, and Reproducibility in Next-Generation Sequencing Analysis", PhD thesis, TU Dortmund 2014.
https://bioconda.github.io