Reproducible and scalable data analysis with Snakemake

Johannes Köster

2017

Data analysis

dataset

results

Data analysis

dataset

results

dataset

Data analysis

dataset

results

dataset

reproducibility

From raw data to final figures:

document parameters, tools, versions
execute without manual intervention

Data analysis

dataset

results

dataset

scalability

Handle parallelization:

execute for tens to thousands of datasets

Avoid redundancy:

when adding datasets
when resuming from failures

Data analysis

dataset

results

dataset

scalability

reproducibility

Workflow management:

formalize, document and execute data analyses

Large constantly growing community

Reproducibility with Snakemake

Genome of the Netherlands:

GoNL consortium. Nature Genetics 2014.

Cancer:

Townsend et al. Cancer Cell 2016.

Schramm et al. Nature Genetics 2015.

Martin et al. Nature Genetics 2013.

Ebola:

Park et al. Cell 2015

iPSC:

Burrows et al. PLOS Genetics 2016.

Computational methods:

Ziller et al. Nature Methods 2015.

Schmied et al. Bioinformatics 2015.

Břinda et al. Bioinformatics 2015

Chang et al. Molecular Cell 2014.

Marschall et al. Bioinformatics 2012.

dataset

results

dataset

Define workflows

in terms of rules

Define workflows

in terms of rules

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    script:
        "scripts/myscript.R"


rule myfiltration:
     input:
        "result/{dataset}.txt"
     output:
        "result/{dataset}.filtered.txt"
     shell:
        "mycommand {input} > {output}"


rule aggregate:
    input:
        "results/dataset1.filtered.txt",
        "results/dataset2.filtered.txt"
    output:
        "plots/myplot.pdf"
    script:
        "scripts/myplot.R"

Define workflows

in terms of rules

Define workflows

in terms of rules

rule mytask:
    input:
        "path/to/dataset.txt"
    output:
        "result/dataset.txt"
    shell:
        "mycommand {input} > {output}"

rule name

refer to input and output from shell command

how to create output from input

Define workflows

in terms of rules

generalize rules with

named wildcards

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    shell:
        "mycommand {input} > {output}"

Define workflows

in terms of rules

refer to Python script

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    script:
        "scripts/myscript.py"

Define workflows

in terms of rules

refer to R script

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    script:
        "scripts/myscript.R"

Define workflows

in terms of rules

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    script:
        "scripts/myscript.R"

rule aggregate:
    input:
        "results/dataset1.txt",
        "results/dataset2.txt"
    output:
        "plots/myplot.pdf"
    script:
        "scripts/myplot.R"

rule mytask:
    input:
        "path/to/dataset2.txt"
    output:
        "result/dataset2.txt"
    script:
        "scripts/myscript.R"

rule aggregate:
    input:
        "results/dataset1.txt",
        "results/dataset2.txt"
    output:
        "plots/myplot.pdf"
    script:
        "scripts/myplot.R"

rule mytask:
    input:
        "path/to/dataset1.txt"
    output:
        "result/dataset1.txt"
    script:
        "scripts/myscript.R"

Dependencies are determined automatically

Directed acyclic graph (DAG) of jobs

Command line interface

# execute the workflow with target D1.sorted.txt
snakemake D1.sorted.txt

# execute the workflow without target: first rule defines target
snakemake

# dry-run
snakemake -n

# dry-run, print shell commands
snakemake -n -p

# dry-run, print execution reason for each job
snakemake -n -r

# visualize the DAG of jobs using the Graphviz dot command
snakemake --dag | dot -Tsvg > dag.svg

Parallelization

Disjoint paths in the DAG of jobs can be executed in parallel.

# execute the workflow with 8 cores
snakemake --cores 8

execute 8 jobs in parallel?

Parallelization

schedule according to given resources

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    threads: 4
    resources:
        mem_gb=2
    shell:
        "mycommand {input} > {output}"

Command line interface

# execute the workflow with 8 cores
snakemake --cores 8

# execute the workflow with 8 cores and 100MB memory
snakemake --cores 8 --resources mem_gb=3

can execute 2 jobs in parallel

can execute only 1 job in parallel

Scheduling

\max_{E \subseteq J} \sum_{j \in E}\, (p_j, d_j, i_j)^T

\max_{E \subseteq J} \sum_{j \in E}\, (p_j, d_j, i_j)^T

\sum_{j \in E} r_{ij} \leq R_i \text{ for } i=1,2,...,n

\sum_{j \in E} r_{ij} \leq R_i \text{ for } i=1,2,...,n

s.t.

available jobs

priority

descendants

input size

resource usage

free resource (e.g. CPU cores)

Many additional features

scaling from workstation to cluster without workflow modification
modularization
handling of temporary and protected files
record logging information
HTML reports
tracking of tool versions and code changes
remote file support (S3/Google, Dropbox, HTTPS, FTP, ...)

Reproducible software installation

dataset

results

dataset

Full reproducibility:

install required software and all dependencies in exact versions

Software installation is heterogeneous

source("https://bioconductor.org/biocLite.R")
biocLite("DESeq2")

easy_install snakemake

./configure --prefix=/usr/local
make
make install

cp lib/amd64/jli/*.so lib
cp lib/amd64/*.so lib
cp * $PREFIX

cpan -i bioperl

cmake ../../my_project \
    -DCMAKE_MODULE_PATH=~/devel/seqan/util/cmake \
    -DSEQAN_INCLUDE_PATH=~/devel/seqan/include
make
make install

apt-get install bwa

yum install python-h5py

install.packages("matrixpls")

Conda package manager

source or binary

package:
  name: seqtk
  version: 1.2

source:
  fn: v1.2.tar.gz
  url: https://github.com/lh3/seqtk/archive/v1.2.tar.gz

requirements:
  build:
    - gcc
    - zlib
  run:
    - zlib

about:
  home: https://github.com/lh3/seqtk
  license: MIT License
  summary: Seqtk is a fast and lightweight tool for processing sequences

test:
  commands:
    - seqtk seq

package

Normalization of installation routines via recipes:

Easy installation and management:

conda install --channel bioconda bwa=0.7.15

conda update bwa

conda remove bwa

Isolated environments:

channels:
  - r
  - bioconda
dependencies:
  - picard ==2.3.0
  - samtools ==1.3.0

Conda package manager

Already over 2000 bioinformatics related conda packages

(C, C++, Python, R, Perl, ...)

Over 130 contributors

CONDA-FORGE

Partner project for general purpose software:

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    conda:
        "envs/mycommand.yaml"
    shell:
        "mycommand {input} > {output}"

Integration with Snakemake

channels:
  - r
  - bioconda
dependencies:
  - mycommand=2.3.1

# automatic deployment of dependencies
snakemake --use-conda

Integrated with 3 popular workflow management systems

Conda Integration with WMS

Sustainable publishing

# archive workflow (including Conda packages)
snakemake --archive myworkflow.tar.gz

Author:

Upload to Zenodo and acquire DOI
Cite DOI

Reviewer/reader:

Download and unpack workflow archive

# execute workflow (Conda packages are deployed automatically)
snakemake --use-conda --cores 16

Conclusion

formalization
documentation
parallelization

of data analyses.

Snakemake ensures reproducibility and scalability via

Bioconda is a distribution of Bioinformatics software that

standardizes
simplifies
automates

the installation.

Combined, they enable fully reproducible data analysis.

https://snakemake.bitbucket.org

Köster, Johannes and Rahmann, Sven. "Snakemake - A scalable bioinformatics workflow engine". Bioinformatics 2012.

Köster, Johannes. "Parallelization, Scalability, and Reproducibility in Next-Generation Sequencing Analysis", PhD thesis, TU Dortmund 2014.

Resources

https://bioconda.github.io