Alexander Schönhuth
2016
Snakemake by Johannes Köster
dataset
results
dataset
results
dataset
dataset
dataset
dataset
dataset
dataset
results
dataset
dataset
dataset
dataset
dataset
reproducibility
From raw data to final figures:
dataset
results
dataset
dataset
dataset
dataset
dataset
scalability
Handle parallelization:
execute for tens to thousands of datasets
Avoid redundancy:
dataset
results
dataset
dataset
dataset
dataset
dataset
scalability
reproducibility
Workflow management:
formalize, document and execute data analyses
Genome of the Netherlands:
GoNL consortium. Nature Genetics 2014.
Cancer:
Townsend et al. Cancer Cell 2016.
Schramm et al. Nature Genetics 2015.
Martin et al. Nature Genetics 2013.
Ebola:
Park et al. Cell 2015
iPSC:
Burrows et al. PLOS Genetics 2016.
Computational methods:
Ziller et al. Nature Methods 2015.
Schmied et al. Bioinformatics 2015.
Břinda et al. Bioinformatics 2015
Chang et al. Molecular Cell 2014.
Marschall et al. Bioinformatics 2012.
dataset
results
dataset
dataset
dataset
dataset
dataset
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
script:
"scripts/myscript.R"
rule myfiltration:
input:
"result/{dataset}.txt"
output:
"result/{dataset}.filtered.txt"
shell:
"mycommand {input} > {output}"
rule aggregate:
input:
"results/dataset1.filtered.txt",
"results/dataset2.filtered.txt"
output:
"plots/myplot.pdf"
script:
"scripts/myplot.R"
rule mytask:
input:
"path/to/dataset.txt"
output:
"result/dataset.txt"
shell:
"mycommand {input} > {output}"
rule name
refer to input and output from shell command
how to create output from input
generalize rules with
named wildcards
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
shell:
"mycommand {input} > {output}"
refer to Python script
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
script:
"scripts/myscript.py"
refer to R script
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
script:
"scripts/myscript.R"
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
script:
"scripts/myscript.R"
rule aggregate:
input:
"results/dataset1.txt",
"results/dataset2.txt"
output:
"plots/myplot.pdf"
script:
"scripts/myplot.R"
rule mytask:
input:
"path/to/dataset2.txt"
output:
"result/dataset2.txt"
script:
"scripts/myscript.R"
rule aggregate:
input:
"results/dataset1.txt",
"results/dataset2.txt"
output:
"plots/myplot.pdf"
script:
"scripts/myplot.R"
rule mytask:
input:
"path/to/dataset1.txt"
output:
"result/dataset1.txt"
script:
"scripts/myscript.R"
rule bwa_map:
input:
"data/genome.fa",
"data/samples/A.fastq"
output:
"mapped_reads/A.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
snakemake mapped_reads/A.bam
# generates target file mapped_reads/A.bam
# using rules in ./Snakefile
snakemake -np mapped_reads/A.bam
# -n is for dry run; -p for printing shell commands
Snakefile
rule bwa_map:
input:
"data/genome.fa",
"data/samples/{sample}.fastq"
output:
"mapped_reads/{sample}.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
snakemake mapped_reads/A.bam mapped_reads/B.bam
# Values A and B will be propagated to {sample}
# generates target files
# mapped_reads/A.bam and mapped_reads/B.bam
snakemake -np mapped_reads/A.bam mapped_reads/B.bam
# -n is for dry run; -p for printing shell commands
Snakefile
rule bwa_map:
input:
"data/genome.fa",
"data/samples/A.fastq"
output:
"mapped_reads/A.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
snakemake sorted_reads/A.bam
# generates target file sorted_reads/A.bam
# Snakemake determines already applied rules in dependency graph
# If sorted_reads/A.bam already generated: nothing to be done
# If mapped_reads/A.bam already generated: only rule samtools_sort applied
# If only original data exists: both bwa_map and samtools_sort applied
# in the order of their dependency
Snakefile
rule samtools_sort:
input:
"mapped_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam"
shell:
"samtools sort -T sorted_reads/{wildcards.sample}"
"-O bam {input} > {output}"
rule sort_and_annotate:
input:
"path/to/{dataset}.txt",
"path/to/annotation.txt"
output:
"{dataset}.sorted.txt"
shell:
"paste <(sort {input[0]}) {input[1]} > {output}"
multiple input or output files
refer by index
rule sort_and_annotate:
input:
a="path/to/{dataset}.txt",
b="path/to/annotation.txt"
output:
"{dataset}.sorted.txt"
shell:
"paste <(sort {input.a}) {input.b} > {output}"
name input and output files
refer by name
rule sort:
input:
a="path/to/{dataset}.txt"
output:
b="{dataset}.sorted.txt"
run:
with open(output.b, "w") as out:
for l in sorted(open(input.a)):
print(l, file=out)
use Python in rules
rule sort:
input:
a="path/to/{dataset}.txt"
output:
b="{dataset}.sorted.txt"
script:
"scripts/myscript.R"
refer to Python or R scripts
(in version 3.5)
DATASETS = ["D1", "D2", "D3"]
rule all:
input:
expand("{dataset}.sorted.txt", dataset=DATASETS)
rule sort:
input:
"path/to/{dataset}.txt"
output:
"{dataset}.sorted.txt"
shell:
"sort {input} > {output}"
Job 1: apply rule all
(a target rule that just collects results)
Job i: apply rule sort to create i-th input of job 1
use arbitrary Python code in your workflow
A job is executed if and only if
determined via breadth-first-search on DAG of jobs
Assumption: workflow defined in a Snakefile in the same directory.
# execute the workflow with target D1.sorted.txt
snakemake D1.sorted.txt
# execute the workflow without target: first rule defines target
snakemake
# dry-run
snakemake -n
# dry-run, print shell commands
snakemake -n -p
# dry-run, print execution reason for each job
snakemake -n -r
# visualize the DAG of jobs using the Graphviz dot command
snakemake --dag | dot -Tsvg > dag.svg
Disjoint paths in the DAG of jobs can be executed in parallel.
# execute the workflow with 8 cores
snakemake --cores 8
execute 8 jobs in parallel?
rule sort:
input:
"path/to/{dataset}.txt"
output:
"{dataset}.sorted.txt"
threads: 4
resources: mem_mb=100
shell:
"sort --parallel {threads} {input} > {output}"
refer to defined thread number
define arbitrary additional resources
define used threads
Assumption: workflow defined in a Snakefile in the same directory.
# execute the workflow with 8 cores
snakemake --cores 8
# prioritize the creation of a certain file
snakemake --prioritize D1.sorted.txt --cores 8
# execute the workflow with 8 cores and 100MB memory
snakemake --cores 8 --resources mem_mb=100
can execute 2 sort jobs in parallel
can execute only 1 sort job in parallel
Available jobs are scheduled to
s.t.
available jobs
priority
descendants
input size
resource usage
free resource (e.g. CPU cores)
configfile: "config.yaml"
rule all:
input:
expand("{dataset}.sorted.txt", dataset=config["datasets"])
rule sort:
input:
"path/to/{dataset}.txt"
output:
"{dataset}.sorted.txt"
threads: 4
resources: mem_mb=100
shell:
"sort --parallel {threads} {input} > {output}"
define config file
refer to config values
Workflows are executed in three phases
Input functions defer determination of input files to the DAG phase
(when wildcard values are known).
configfile: "config.yaml"
rule all:
input:
expand("{dataset}.sorted.txt", dataset=config["datasets"])
rule sort:
input:
lambda wildcards: config["datasets"][wildcards.dataset]
output:
"{dataset}.sorted.txt"
log:
"logs/sort/{dataset}.log"
threads: 4
resources: mem_mb=100
shell:
"sort --parallel {threads} {input} > {output} 2> {log}"
define log file
refer to log file from shell command
# execute the workflow on cluster with qsub submission command
# (and up to 100 parallel jobs)
snakemake --cluster qsub --jobs 100
# tell the cluster system about the used threads
snakemake --cluster "qsub -pe threaded {threads}" --jobs 100
# execute the workflow with synchronized qsub
snakemake --cluster-sync "qsub -sync yes" --jobs 100
# execute the workflow with DRMAA
snakemake --drmaa --jobs 100
source("https://bioconductor.org/biocLite.R")
biocLite("DESeq2")
easy_install snakemake
./configure --prefix=/usr/local
make
make install
cp lib/amd64/jli/*.so lib
cp lib/amd64/*.so lib
cp * $PREFIX
cpan -i bioperl
cmake ../../my_project \
-DCMAKE_MODULE_PATH=~/devel/seqan/util/cmake \
-DSEQAN_INCLUDE_PATH=~/devel/seqan/include
make
make install
apt-get install bwa
yum install python-h5py
install.packages("matrixpls")
Standardize Bioinformatics software distribution.
Started last fall:
>80 contributors
>1000 packages
Standardize Bioinformatics software distribution.
source or binary
package:
name: seqtk
version: 1.2
source:
fn: v1.2.tar.gz
url: https://github.com/lh3/seqtk/archive/v1.2.tar.gz
requirements:
build:
- gcc
- zlib
run:
- zlib
about:
home: https://github.com/lh3/seqtk
license: MIT License
summary: Seqtk is a fast and lightweight tool for processing sequences
test:
commands:
- seqtk seq
package
dataset
results
dataset
dataset
dataset
dataset
dataset
Full reproducibility:
install required software and all dependencies in exact versions
Easy installation and management:
conda install --channel bioconda bwa=0.7.15
conda update bwa
conda remove bwa
Isolated environments:
channels:
- r
- bioconda
dependencies:
- picard ==2.3.0
- samtools ==1.3.0
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
environment:
"envs/mycommand.yaml"
shell:
"mycommand {input} > {output}"
channels:
- r
- bioconda
dependencies:
- mycommand ==2.3.1
Solution 1: Git repository with
├── config.yaml
├── environment.yml
├── scripts
│ ├── script1.py
│ └── script2.R
└── Snakefile
# clone workflow into working directory
git clone https://bitbucket.org/user/myworkflow.git path/to/workdir
cd path/to/workdir
# edit config and workflow as needed
vim config.yaml
# install dependencies into isolated environment
conda env create -n myworkflow --file environment.yml
# activate environment
source activate myworkflow
# execute workflow
snakemake -n
Solution 2: Python/Conda package
# install workflow with all dependencies into isolated environment
conda create -n myworkflow myworkflow
# activate environment
source activate myworkflow
# copy Snakefile and config file into working directory
myworkflow init path/to/workdir
cd path/to/workdir
# edit config and workflow as needed
vim config.yaml
# execute workflow
snakemake -n
Solution 3: Hide workflow in package
# install workflow with all dependencies into isolated environment
conda create -n myworkflow myworkflow
# activate environment
source activate myworkflow
# copy only config file into working directory
myworkflow init path/to/workdir
cd path/to/workdir
# edit config as needed
vim config.yaml
# execute workflow with a wrapper that uses a Snakefile
# hidden in the package and delegates execution to Snakemake
myworkflow run -n
of data analyses.
Snakemake ensures reproducibility and scalability via
Bioconda is a distribution of Bioinformatics software that
the installation.
Combined, they enable fully reproducible data analysis.