Reproducible and scalable data analysis with Snakemake and Bioconda
Johannes Köster
2016
![](https://s3.amazonaws.com/media-p.slid.es/uploads/362168/images/2480824/CWI_Logo.png)
Data analysis
dataset
results
Data analysis
dataset
results
dataset
dataset
dataset
dataset
dataset
Data analysis
dataset
results
dataset
dataset
dataset
dataset
dataset
reproducibility
From raw data to final figures:
- document parameters, tools, versions
- execute without manual intervention
Data analysis
dataset
results
dataset
dataset
dataset
dataset
dataset
scalability
Handle parallelization:
execute for tens to thousands of datasets
Avoid redundancy:
- when adding datasets
- when resuming from failures
Data analysis
dataset
results
dataset
dataset
dataset
dataset
dataset
scalability
reproducibility
Workflow management:
formalize, document and execute data analyses
Snakemake
![](https://s3.amazonaws.com/media-p.slid.es/uploads/362168/images/2765493/snakemake-paper.png)
Large constantly growing community
Reproducibility with Snakemake
Genome of the Netherlands:
GoNL consortium. Nature Genetics 2014.
Cancer:
Townsend et al. Cancer Cell 2016.
Schramm et al. Nature Genetics 2015.
Martin et al. Nature Genetics 2013.
Ebola:
Park et al. Cell 2015
iPSC:
Burrows et al. PLOS Genetics 2016.
Computational methods:
Ziller et al. Nature Methods 2015.
Schmied et al. Bioinformatics 2015.
Břinda et al. Bioinformatics 2015
Chang et al. Molecular Cell 2014.
Marschall et al. Bioinformatics 2012.
![](https://s3.amazonaws.com/media-p.slid.es/uploads/362168/images/2646163/nature_genetics.gif)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/362168/images/2646162/cancer_cell.jpg)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/362168/images/2646179/nature_methods.gif)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/362168/images/2646177/genome_biology.jpg)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/362168/images/2646170/bioinformatics.gif)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/362168/images/2646184/cell.jpg)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/362168/images/2646193/molecular_cell.jpg)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/362168/images/2646200/plos_genetics.png)
dataset
results
dataset
dataset
dataset
dataset
dataset
Define workflows
in terms of rules
Define workflows
in terms of rules
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
script:
"scripts/myscript.R"
rule myfiltration:
input:
"result/{dataset}.txt"
output:
"result/{dataset}.filtered.txt"
shell:
"mycommand {input} > {output}"
rule aggregate:
input:
"results/dataset1.filtered.txt",
"results/dataset2.filtered.txt"
output:
"plots/myplot.pdf"
script:
"scripts/myplot.R"
Define workflows
in terms of rules
Define workflows
in terms of rules
rule mytask:
input:
"path/to/dataset.txt"
output:
"result/dataset.txt"
shell:
"mycommand {input} > {output}"
rule name
refer to input and output from shell command
how to create output from input
Define workflows
in terms of rules
generalize rules with
named wildcards
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
shell:
"mycommand {input} > {output}"
Define workflows
in terms of rules
refer to Python script
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
script:
"scripts/myscript.py"
Define workflows
in terms of rules
refer to R script
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
script:
"scripts/myscript.R"
Define workflows
in terms of rules
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
script:
"scripts/myscript.R"
rule aggregate:
input:
"results/dataset1.txt",
"results/dataset2.txt"
output:
"plots/myplot.pdf"
script:
"scripts/myplot.R"
rule mytask:
input:
"path/to/dataset2.txt"
output:
"result/dataset2.txt"
script:
"scripts/myscript.R"
rule aggregate:
input:
"results/dataset1.txt",
"results/dataset2.txt"
output:
"plots/myplot.pdf"
script:
"scripts/myplot.R"
rule mytask:
input:
"path/to/dataset1.txt"
output:
"result/dataset1.txt"
script:
"scripts/myscript.R"
Dependencies are determined automatically
Directed acyclic graph (DAG) of jobs
Command line interface
Assumption: workflow defined in a Snakefile in the same directory.
# execute the workflow with target D1.sorted.txt
snakemake D1.sorted.txt
# execute the workflow without target: first rule defines target
snakemake
# dry-run
snakemake -n
# dry-run, print shell commands
snakemake -n -p
# dry-run, print execution reason for each job
snakemake -n -r
# visualize the DAG of jobs using the Graphviz dot command
snakemake --dag | dot -Tsvg > dag.svg
Parallelization
Disjoint paths in the DAG of jobs can be executed in parallel.
# execute the workflow with 8 cores
snakemake --cores 8
execute 8 jobs in parallel?
Parallelization
schedule according to given resources
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
threads: 4
resources:
mem_gb=2
shell:
"mycommand {input} > {output}"
Command line interface
Assumption: workflow defined in a Snakefile in the same directory.
# execute the workflow with 8 cores
snakemake --cores 8
# execute the workflow with 8 cores and 100MB memory
snakemake --cores 8 --resources mem_gb=3
can execute 2 jobs in parallel
can execute only 1 job in parallel
Scheduling
s.t.
available jobs
priority
descendants
input size
resource usage
free resource (e.g. CPU cores)
Many additional features
- scaling from workstation to cluster without workflow modification
- modularization
- handling of temporary and protected files
- record logging information
- HTML reports
- tracking of tool versions and code changes
Reproducible software installation
dataset
results
dataset
dataset
dataset
dataset
dataset
Full reproducibility:
install required software and all dependencies in exact versions
Software installation is heterogeneous
source("https://bioconductor.org/biocLite.R")
biocLite("DESeq2")
easy_install snakemake
./configure --prefix=/usr/local
make
make install
cp lib/amd64/jli/*.so lib
cp lib/amd64/*.so lib
cp * $PREFIX
cpan -i bioperl
cmake ../../my_project \
-DCMAKE_MODULE_PATH=~/devel/seqan/util/cmake \
-DSEQAN_INCLUDE_PATH=~/devel/seqan/include
make
make install
apt-get install bwa
yum install python-h5py
install.packages("matrixpls")
Standardize Bioinformatics software distribution.
Started last fall:
>100 contributors
>1500 packages
Standardize Bioinformatics software distribution.
source or binary
package:
name: seqtk
version: 1.2
source:
fn: v1.2.tar.gz
url: https://github.com/lh3/seqtk/archive/v1.2.tar.gz
requirements:
build:
- gcc
- zlib
run:
- zlib
about:
home: https://github.com/lh3/seqtk
license: MIT License
summary: Seqtk is a fast and lightweight tool for processing sequences
test:
commands:
- seqtk seq
package
Easy installation and management:
conda install --channel bioconda bwa=0.7.15
conda update bwa
conda remove bwa
Isolated environments:
channels:
- r
- bioconda
dependencies:
- picard ==2.3.0
- samtools ==1.3.0
Based on the conda package manager
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
environment:
"envs/mycommand.yaml"
shell:
"mycommand {input} > {output}"
Integration with Snakemake
channels:
- r
- bioconda
dependencies:
- mycommand ==2.3.1
Conclusion
- formalization
- documentation
- parallelization
of data analyses.
Snakemake ensures reproducibility and scalability via
Bioconda is a distribution of Bioinformatics software that
- standardizes
- simplifies
- automates
the installation.
Combined, they enable fully reproducible data analysis.
Snakemake and Bioconda
By Johannes Köster
Snakemake and Bioconda
Non-CS introduction to Snakemake and Bioconda at Leeselab.
- 3,130