Reproducible and scalable data analysis with Snakemake
Johannes Köster
2017
Data analysis
dataset
results
Data analysis
dataset
results
dataset
dataset
dataset
dataset
dataset
Data analysis
dataset
results
dataset
dataset
dataset
dataset
dataset
reproducibility
From raw data to final figures:
- document parameters, tools, versions
- execute without manual intervention
Data analysis
dataset
results
dataset
dataset
dataset
dataset
dataset
scalability
Handle parallelization:
execute for tens to thousands of datasets
Avoid redundancy:
- when adding datasets
- when resuming from failures
Data analysis
dataset
results
dataset
dataset
dataset
dataset
dataset
scalability
reproducibility
Workflow management:
formalize, document and execute data analyses
Large constantly growing community
Reproducibility with Snakemake
Genome of the Netherlands:
GoNL consortium. Nature Genetics 2014.
Cancer:
Townsend et al. Cancer Cell 2016.
Schramm et al. Nature Genetics 2015.
Martin et al. Nature Genetics 2013.
Ebola:
Park et al. Cell 2015
iPSC:
Burrows et al. PLOS Genetics 2016.
Computational methods:
Ziller et al. Nature Methods 2015.
Schmied et al. Bioinformatics 2015.
Břinda et al. Bioinformatics 2015
Chang et al. Molecular Cell 2014.
Marschall et al. Bioinformatics 2012.
dataset
results
dataset
dataset
dataset
dataset
dataset
Define workflows
in terms of rules
Define workflows
in terms of rules
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
script:
"scripts/myscript.R"
rule myfiltration:
input:
"result/{dataset}.txt"
output:
"result/{dataset}.filtered.txt"
shell:
"mycommand {input} > {output}"
rule aggregate:
input:
"results/dataset1.filtered.txt",
"results/dataset2.filtered.txt"
output:
"plots/myplot.pdf"
script:
"scripts/myplot.R"
Define workflows
in terms of rules
Define workflows
in terms of rules
rule mytask:
input:
"path/to/dataset.txt"
output:
"result/dataset.txt"
shell:
"mycommand {input} > {output}"
rule name
refer to input and output from shell command
how to create output from input
Define workflows
in terms of rules
generalize rules with
named wildcards
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
shell:
"mycommand {input} > {output}"
Define workflows
in terms of rules
refer to Python script
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
script:
"scripts/myscript.py"
Define workflows
in terms of rules
refer to R script
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
script:
"scripts/myscript.R"
Define workflows
in terms of rules
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
script:
"scripts/myscript.R"
rule aggregate:
input:
"results/dataset1.txt",
"results/dataset2.txt"
output:
"plots/myplot.pdf"
script:
"scripts/myplot.R"
rule mytask:
input:
"path/to/dataset2.txt"
output:
"result/dataset2.txt"
script:
"scripts/myscript.R"
rule aggregate:
input:
"results/dataset1.txt",
"results/dataset2.txt"
output:
"plots/myplot.pdf"
script:
"scripts/myplot.R"
rule mytask:
input:
"path/to/dataset1.txt"
output:
"result/dataset1.txt"
script:
"scripts/myscript.R"
Dependencies are determined automatically
Directed acyclic graph (DAG) of jobs
Command line interface
# execute the workflow with target D1.sorted.txt
snakemake D1.sorted.txt
# execute the workflow without target: first rule defines target
snakemake
# dry-run
snakemake -n
# dry-run, print shell commands
snakemake -n -p
# dry-run, print execution reason for each job
snakemake -n -r
# visualize the DAG of jobs using the Graphviz dot command
snakemake --dag | dot -Tsvg > dag.svg
Parallelization
Disjoint paths in the DAG of jobs can be executed in parallel.
# execute the workflow with 8 cores
snakemake --cores 8
execute 8 jobs in parallel?
Parallelization
schedule according to given resources
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
threads: 4
resources:
mem_gb=2
shell:
"mycommand {input} > {output}"
Command line interface
# execute the workflow with 8 cores
snakemake --cores 8
# execute the workflow with 8 cores and 100MB memory
snakemake --cores 8 --resources mem_gb=3
can execute 2 jobs in parallel
can execute only 1 job in parallel
Scheduling
s.t.
available jobs
priority
descendants
input size
resource usage
free resource (e.g. CPU cores)
Many additional features
- scaling from workstation to cluster without workflow modification
- modularization
- handling of temporary and protected files
- record logging information
- HTML reports
- tracking of tool versions and code changes
- remote file support (S3/Google, Dropbox, HTTPS, FTP, ...)
Reproducible software installation
dataset
results
dataset
dataset
dataset
dataset
dataset
Full reproducibility:
install required software and all dependencies in exact versions
Software installation is heterogeneous
source("https://bioconductor.org/biocLite.R")
biocLite("DESeq2")
easy_install snakemake
./configure --prefix=/usr/local
make
make install
cp lib/amd64/jli/*.so lib
cp lib/amd64/*.so lib
cp * $PREFIX
cpan -i bioperl
cmake ../../my_project \
-DCMAKE_MODULE_PATH=~/devel/seqan/util/cmake \
-DSEQAN_INCLUDE_PATH=~/devel/seqan/include
make
make install
apt-get install bwa
yum install python-h5py
install.packages("matrixpls")
Conda package manager
source or binary
package:
name: seqtk
version: 1.2
source:
fn: v1.2.tar.gz
url: https://github.com/lh3/seqtk/archive/v1.2.tar.gz
requirements:
build:
- gcc
- zlib
run:
- zlib
about:
home: https://github.com/lh3/seqtk
license: MIT License
summary: Seqtk is a fast and lightweight tool for processing sequences
test:
commands:
- seqtk seq
package
Normalization of installation routines via recipes:
Easy installation and management:
conda install --channel bioconda bwa=0.7.15
conda update bwa
conda remove bwa
Isolated environments:
channels:
- r
- bioconda
dependencies:
- picard ==2.3.0
- samtools ==1.3.0
Conda package manager
Already over 2000 bioinformatics related conda packages
(C, C++, Python, R, Perl, ...)
Over 130 contributors
CONDA-FORGE
Partner project for general purpose software:
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
conda:
"envs/mycommand.yaml"
shell:
"mycommand {input} > {output}"
Integration with Snakemake
channels:
- r
- bioconda
dependencies:
- mycommand=2.3.1
# automatic deployment of dependencies
snakemake --use-conda
Integrated with 3 popular workflow management systems
Conda Integration with WMS
Sustainable publishing
# archive workflow (including Conda packages)
snakemake --archive myworkflow.tar.gz
Author:
- Upload to Zenodo and acquire DOI
- Cite DOI
Reviewer/reader:
- Download and unpack workflow archive
# execute workflow (Conda packages are deployed automatically)
snakemake --use-conda --cores 16
Conclusion
- formalization
- documentation
- parallelization
of data analyses.
Snakemake ensures reproducibility and scalability via
Bioconda is a distribution of Bioinformatics software that
- standardizes
- simplifies
- automates
the installation.
Combined, they enable fully reproducible data analysis.
https://snakemake.bitbucket.org
Köster, Johannes and Rahmann, Sven. "Snakemake - A scalable bioinformatics workflow engine". Bioinformatics 2012.
Köster, Johannes. "Parallelization, Scalability, and Reproducibility in Next-Generation Sequencing Analysis", PhD thesis, TU Dortmund 2014.
Resources
https://bioconda.github.io
Reproducible and scalable data analysis with Snakemake
By Johannes Köster
Reproducible and scalable data analysis with Snakemake
Non-CS introduction to Snakemake and Bioconda at Forschungszentrum Jülich.
- 2,710