Snakemake: Reproducible and scalable data analysis

Alexander Schönhuth

2016

Snakemake by Johannes Köster

Data analysis

dataset

results

Data analysis

dataset

results

dataset

dataset

dataset

dataset

dataset

Data analysis

dataset

results

dataset

dataset

dataset

dataset

dataset

reproducibility

From raw data to final figures:

  • document parameters, tools, versions
  • execute without manual intervention

Data analysis

dataset

results

dataset

dataset

dataset

dataset

dataset

scalability

Handle parallelization:

execute for tens to thousands of datasets

Avoid redundancy:

  • when adding datasets
  • when resuming from failures

Data analysis

dataset

results

dataset

dataset

dataset

dataset

dataset

scalability

reproducibility

Workflow management:

formalize, document and execute data analyses

Snakemake

Large constantly growing community

Reproducibility with Snakemake

Genome of the Netherlands:

GoNL consortium. Nature Genetics 2014.

 

Cancer:

Townsend et al. Cancer Cell 2016.

Schramm et al. Nature Genetics 2015.

Martin et al. Nature Genetics 2013.

 

Ebola:

Park et al. Cell 2015

 

iPSC:

Burrows et al. PLOS Genetics 2016.

 

Computational methods:

Ziller et al. Nature Methods 2015.

Schmied et al. Bioinformatics 2015.

Břinda et al. Bioinformatics 2015

Chang et al. Molecular Cell 2014.

Marschall et al. Bioinformatics 2012.

dataset

results

dataset

dataset

dataset

dataset

dataset

Define workflows

in terms of rules

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    script:
        "scripts/myscript.R"


rule myfiltration:
     input:
        "result/{dataset}.txt"
     output:
        "result/{dataset}.filtered.txt"
     shell:
        "mycommand {input} > {output}"


rule aggregate:
    input:
        "results/dataset1.filtered.txt",
        "results/dataset2.filtered.txt"
    output:
        "plots/myplot.pdf"
    script:
        "scripts/myplot.R"

Define workflows

in terms of rules

Define workflows

in terms of rules

rule mytask:
    input:
        "path/to/dataset.txt"
    output:
        "result/dataset.txt"
    shell:
        "mycommand {input} > {output}"

rule name

refer to input and output from shell command

how to create output from input

Define workflows

in terms of rules

generalize rules with

named wildcards

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    shell:
        "mycommand {input} > {output}"

Define workflows

in terms of rules

refer to Python script

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    script:
        "scripts/myscript.py"

Define workflows

in terms of rules

refer to R script

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    script:
        "scripts/myscript.R"

Define workflows

in terms of rules

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    script:
        "scripts/myscript.R"
rule aggregate:
    input:
        "results/dataset1.txt",
        "results/dataset2.txt"
    output:
        "plots/myplot.pdf"
    script:
        "scripts/myplot.R"
rule mytask:
    input:
        "path/to/dataset2.txt"
    output:
        "result/dataset2.txt"
    script:
        "scripts/myscript.R"
rule aggregate:
    input:
        "results/dataset1.txt",
        "results/dataset2.txt"
    output:
        "plots/myplot.pdf"
    script:
        "scripts/myplot.R"
rule mytask:
    input:
        "path/to/dataset1.txt"
    output:
        "result/dataset1.txt"
    script:
        "scripts/myscript.R"

Dependencies are determined automatically

Directed acyclic graph (DAG) of jobs

Bio Example

rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/A.fastq"
    output:
        "mapped_reads/A.bam"
    shell:
        "bwa mem {input} | samtools view -Sb - > {output}"
    
  • Rule has name: bwa_map
  • Directives: input, output, shell
  • Referring to rule elements: {input}, {output}
snakemake mapped_reads/A.bam

# generates target file mapped_reads/A.bam 
# using rules in ./Snakefile

snakemake -np mapped_reads/A.bam 

# -n is for dry run; -p for printing shell commands

Snakefile

Wildcards

rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "mapped_reads/{sample}.bam"
    shell:
        "bwa mem {input} | samtools view -Sb - > {output}"
    
  • Wildcards (here: {sample}) are specified in output
  • All output files have to contain same wildcards
  • Snakemake command is parsed, wildcards resolved
snakemake mapped_reads/A.bam mapped_reads/B.bam

# Values A and B will be propagated to {sample}
# generates target files 
# mapped_reads/A.bam and mapped_reads/B.bam 

snakemake -np mapped_reads/A.bam mapped_reads/B.bam

# -n is for dry run; -p for printing shell commands

Snakefile

Dependencies

rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/A.fastq"
    output:
        "mapped_reads/A.bam"
    shell:
        "bwa mem {input} | samtools view -Sb - > {output}"
    
snakemake sorted_reads/A.bam

# generates target file sorted_reads/A.bam 

# Snakemake determines already applied rules in dependency graph

# If sorted_reads/A.bam already generated: nothing to be done
# If mapped_reads/A.bam already generated: only rule samtools_sort applied
# If only original data exists: both bwa_map and samtools_sort applied
# in the order of their dependency

Snakefile

rule samtools_sort:
    input:
        "mapped_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam"
    shell:
        "samtools sort -T sorted_reads/{wildcards.sample}" 
        "-O bam {input} > {output}"
    

Define workflows

in terms of rules

rule sort_and_annotate:
    input:
        "path/to/{dataset}.txt",
        "path/to/annotation.txt"
    output:
        "{dataset}.sorted.txt"
    shell:
        "paste <(sort {input[0]}) {input[1]} > {output}"

multiple input or output files

refer by index

Define workflows

in terms of rules

rule sort_and_annotate:
    input:
        a="path/to/{dataset}.txt",
        b="path/to/annotation.txt"
    output:
        "{dataset}.sorted.txt"
    shell:
        "paste <(sort {input.a}) {input.b} > {output}"

name input and output files

refer by name

Define workflows

in terms of rules

rule sort:
    input:
        a="path/to/{dataset}.txt"
    output:
        b="{dataset}.sorted.txt"
    run:
        with open(output.b, "w") as out:
            for l in sorted(open(input.a)):
                print(l, file=out)

use Python in rules

Define workflows

in terms of rules

rule sort:
    input:
        a="path/to/{dataset}.txt"
    output:
        b="{dataset}.sorted.txt"
    script:
        "scripts/myscript.R"

refer to Python or R scripts

(in version 3.5)

Dependencies are determined top-down

  • For a given target, a rule that can be applied to create it is determined (a job).
  • For the input files of the rule, go on recursively.
  • If no target is specified, Snakemake tries to apply the first rule in the workflow.

Dependencies are determined top-down

DATASETS = ["D1", "D2", "D3"]


rule all:
    input:
        expand("{dataset}.sorted.txt", dataset=DATASETS)


rule sort:
    input:
        "path/to/{dataset}.txt"
    output:
        "{dataset}.sorted.txt"
    shell:
        "sort {input} > {output}"

Job 1: apply rule all

(a target rule that just collects results)

Job i: apply rule sort to create i-th input of job 1

use arbitrary Python code in your workflow

Directed acyclic graph (DAG) of jobs

Job execution

A job is executed if and only if

  • output file is target and does not exist
  • output file needed by another executed job and does not exist
  • input file newer than output file
  • input file will be updated by other job
  • execution is enforced

determined via breadth-first-search on DAG of jobs

Command line interface

Assumption: workflow defined in a Snakefile in the same directory.

# execute the workflow with target D1.sorted.txt
snakemake D1.sorted.txt

# execute the workflow without target: first rule defines target
snakemake

# dry-run
snakemake -n

# dry-run, print shell commands
snakemake -n -p

# dry-run, print execution reason for each job
snakemake -n -r

# visualize the DAG of jobs using the Graphviz dot command
snakemake --dag | dot -Tsvg > dag.svg

Part 2

Advanced

Parallelization

Disjoint paths in the DAG of jobs can be executed in parallel.

# execute the workflow with 8 cores
snakemake --cores 8

execute 8 jobs in parallel?

Defining resources

rule sort:
    input:
        "path/to/{dataset}.txt"
    output:
        "{dataset}.sorted.txt"
    threads: 4
    resources: mem_mb=100
    shell:
        "sort --parallel {threads} {input} > {output}"

refer to defined thread number

define arbitrary additional resources

define used threads

Command line interface

Assumption: workflow defined in a Snakefile in the same directory.

# execute the workflow with 8 cores
snakemake --cores 8

# prioritize the creation of a certain file
snakemake --prioritize D1.sorted.txt --cores 8

# execute the workflow with 8 cores and 100MB memory
snakemake --cores 8 --resources mem_mb=100

can execute 2 sort jobs in parallel

can execute only 1 sort job in parallel

Scheduling

Available jobs are scheduled to

  • maximize parallelization
  • prefer high priority jobs
  • while satisfying resource constraints.

Scheduling

\max_{E \subseteq J} \sum_{j \in E}\, (p_j, d_j, i_j)^T
maxEJjE(pj,dj,ij)T\max_{E \subseteq J} \sum_{j \in E}\, (p_j, d_j, i_j)^T
\sum_{j \in E} r_{ij} \leq R_i \text{ for } i=1,2,...,n
jErijRi for i=1,2,...,n\sum_{j \in E} r_{ij} \leq R_i \text{ for } i=1,2,...,n

s.t.

available jobs

priority

descendants

input size

resource usage

free resource (e.g. CPU cores)

Config files

configfile: "config.yaml"


rule all:
    input:
        expand("{dataset}.sorted.txt", dataset=config["datasets"])


rule sort:
    input:
        "path/to/{dataset}.txt"
    output:
        "{dataset}.sorted.txt"
    threads: 4
    resources: mem_mb=100
    shell:
        "sort --parallel {threads} {input} > {output}"

define config file

refer to config values

Input functions

Workflows are executed in three phases

  • initialization phase (parsing)
  • DAG phase (DAG is built)
  • scheduling phase (execution of DAG)

Input functions defer determination of input files to the DAG phase

(when wildcard values are known).

Logging

configfile: "config.yaml"


rule all:
    input:
        expand("{dataset}.sorted.txt", dataset=config["datasets"])


rule sort:
    input:
        lambda wildcards: config["datasets"][wildcards.dataset]
    output:
        "{dataset}.sorted.txt"
    log:
        "logs/sort/{dataset}.log"
    threads: 4
    resources: mem_mb=100
    shell:
        "sort --parallel {threads} {input} > {output} 2> {log}"

define log file

refer to log file from shell command

Cluster execution

# execute the workflow on cluster with qsub submission command
# (and up to 100 parallel jobs)
snakemake --cluster qsub --jobs 100

# tell the cluster system about the used threads
snakemake --cluster "qsub -pe threaded {threads}" --jobs 100

# execute the workflow with synchronized qsub
snakemake --cluster-sync "qsub -sync yes" --jobs 100

# execute the workflow with DRMAA
snakemake --drmaa --jobs 100

Many additional features

  • scaling from workstation to cluster without workflow modification
  • modularization
  • handling of temporary and protected files
  • record logging information
  • HTML reports
  • tracking of tool versions and code changes

Part 3

BioConda

Software installation is heterogeneous

source("https://bioconductor.org/biocLite.R")
biocLite("DESeq2")
easy_install snakemake
./configure --prefix=/usr/local
make
make install
cp lib/amd64/jli/*.so lib
cp lib/amd64/*.so lib
cp * $PREFIX
cpan -i bioperl
cmake ../../my_project \
    -DCMAKE_MODULE_PATH=~/devel/seqan/util/cmake \
    -DSEQAN_INCLUDE_PATH=~/devel/seqan/include
make
make install
apt-get install bwa
yum install python-h5py
install.packages("matrixpls")

Standardize Bioinformatics software distribution.

Started last fall:

>80 contributors

>1000 packages

Standardize Bioinformatics software distribution.

source or binary

package:
  name: seqtk
  version: 1.2

source:
  fn: v1.2.tar.gz
  url: https://github.com/lh3/seqtk/archive/v1.2.tar.gz

requirements:
  build:
    - gcc
    - zlib
  run:
    - zlib

about:
  home: https://github.com/lh3/seqtk
  license: MIT License
  summary: Seqtk is a fast and lightweight tool for processing sequences

test:
  commands:
    - seqtk seq

package

Reproducible software installation

dataset

results

dataset

dataset

dataset

dataset

dataset

Full reproducibility:

install required software and all dependencies in exact versions

Easy installation and management:

conda install --channel bioconda bwa=0.7.15

conda update bwa

conda remove bwa

Isolated environments:

channels:
  - r
  - bioconda
dependencies:
  - picard ==2.3.0
  - samtools ==1.3.0

Based on the conda package manager

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    environment:
        "envs/mycommand.yaml"
    shell:
        "mycommand {input} > {output}"

Integration with Snakemake

channels:
  - r
  - bioconda
dependencies:
  - mycommand ==2.3.1

Distribution of Snakemake workflows

Solution 1: Git repository with

├── config.yaml
├── environment.yml
├── scripts
│   ├── script1.py
│   └── script2.R
└── Snakefile
# clone workflow into working directory
git clone https://bitbucket.org/user/myworkflow.git path/to/workdir
cd path/to/workdir

# edit config and workflow as needed
vim config.yaml

# install dependencies into isolated environment
conda env create -n myworkflow --file environment.yml

# activate environment
source activate myworkflow

# execute workflow
snakemake -n

Distribution of Snakemake workflows

Solution 2: Python/Conda package

# install workflow with all dependencies into isolated environment
conda create -n myworkflow myworkflow

# activate environment
source activate myworkflow

# copy Snakefile and config file into working directory
myworkflow init path/to/workdir
cd path/to/workdir

# edit config and workflow as needed
vim config.yaml

# execute workflow
snakemake -n

Distribution of Snakemake workflows

Solution 3: Hide workflow in package

# install workflow with all dependencies into isolated environment
conda create -n myworkflow myworkflow

# activate environment
source activate myworkflow

# copy only config file into working directory
myworkflow init path/to/workdir
cd path/to/workdir

# edit config as needed
vim config.yaml

# execute workflow with a wrapper that uses a Snakefile
# hidden in the package and delegates execution to Snakemake
myworkflow run -n

Conclusion

  • formalization
  • documentation
  • parallelization

of data analyses.

Snakemake ensures reproducibility and scalability via

Bioconda is a distribution of Bioinformatics software that

  • standardizes
  • simplifies
  • automates

the installation.

Combined, they enable fully reproducible data analysis.

Snakemake and Bioconda: A Tutorial

By olexschoen

Snakemake and Bioconda: A Tutorial

Non-CS introduction to Snakemake and Bioconda

  • 2,733