Snakemake Tutorial

Johannes Köster

 

2017

Data analysis

dataset

results

Data analysis

dataset

results

dataset

dataset

dataset

dataset

dataset

Data analysis

dataset

results

dataset

dataset

dataset

dataset

dataset

reproducibility

From raw data to final figures:

  • document parameters, tools, versions
  • execute without manual intervention

Data analysis

dataset

results

dataset

dataset

dataset

dataset

dataset

scalability

Handle parallelization:

execute for tens to thousands of datasets

Avoid redundancy:

  • when adding datasets
  • when resuming from failures

Data analysis

dataset

results

dataset

dataset

dataset

dataset

dataset

scalability

reproducibility

Workflow management:

formalize, document and execute data analyses

Snakemake

Large, constantly growing community

Reproducibility with Snakemake

Genome of the Netherlands:

GoNL consortium. Nature Genetics 2014.

 

Cancer:

Townsend et al. Cancer Cell 2016.

Schramm et al. Nature Genetics 2015.

Martin et al. Nature Genetics 2013.

 

Ebola:

Park et al. Cell 2015

 

iPSC:

Burrows et al. PLOS Genetics 2016.

 

Computational methods:

Ziller et al. Nature Methods 2015.

Schmied et al. Bioinformatics 2015.

Břinda et al. Bioinformatics 2015

Chang et al. Molecular Cell 2014.

Marschall et al. Bioinformatics 2012.

Part 1

Basics

dataset

results

dataset

dataset

dataset

dataset

dataset

Define workflows

in terms of rules

Define workflows

in terms of rules

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    script:
        "scripts/myscript.R"


rule myfiltration:
     input:
        "result/{dataset}.txt"
     output:
        "result/{dataset}.filtered.txt"
     shell:
        "mycommand {input} > {output}"


rule aggregate:
    input:
        "results/dataset1.filtered.txt",
        "results/dataset2.filtered.txt"
    output:
        "plots/myplot.pdf"
    script:
        "scripts/myplot.R"

Define workflows

in terms of rules

Define workflows

in terms of rules

rule sort:
    input:
        "path/to/dataset.txt"
    output:
        "dataset.sorted.txt"
    shell:
        "sort {input} > {output}"

rule name

refer to input and output from shell command

how to create output from input

Define workflows

in terms of rules

rule sort:
    input:
        "path/to/{dataset}.txt"
    output:
        "{dataset}.sorted.txt"
    shell:
        "sort {input} > {output}"

generalize rules with

named wildcards

Define workflows

in terms of rules

rule sort_and_annotate:
    input:
        "path/to/{dataset}.txt",
        "path/to/annotation.txt"
    output:
        "{dataset}.sorted.txt"
    shell:
        "paste <(sort {input[0]}) {input[1]} > {output}"

multiple input or output files

refer by index

Define workflows

in terms of rules

rule sort_and_annotate:
    input:
        a="path/to/{dataset}.txt",
        b="path/to/annotation.txt"
    output:
        "{dataset}.sorted.txt"
    shell:
        "paste <(sort {input.a}) {input.b} > {output}"

name input and output files

refer by name

Define workflows

in terms of rules

rule sort:
    input:
        a="path/to/{dataset}.txt"
    output:
        b="{dataset}.sorted.txt"
    run:
        with open(output.b, "w") as out:
            for l in sorted(open(input.a)):
                print(l, file=out)

use Python in rules

Dependencies are determined top-down

Solution:

  • For a given target, a rule that can be applied to create it is determined (a job).
  • For the input files of the rule, go on recursively.
  • If no target is specified, Snakemake tries to apply the first rule in the workflow.

Problem:

for a given set of targets, find a composition of rules to create them

Dependencies are determined top-down




rule all:
    input:
        "D1.sorted.txt",
        "D2.sorted.txt",
        "D3.sorted.txt"


rule sort:
    input:
        "path/to/{dataset}.txt"
    output:
        "{dataset}.sorted.txt"
    shell:
        "sort {input} > {output}"

Job 1: apply rule all

(a target rule that just collects results)

Job i: apply rule sort to create i-th input of job 1

Dependencies are determined top-down

DATASETS = ["D1", "D2", "D3"]


rule all:
    input:
        ["{dataset}.sorted.txt".format(dataset=dataset)
         for dataset in DATASETS]



rule sort:
    input:
        "path/to/{dataset}.txt"
    output:
        "{dataset}.sorted.txt"
    shell:
        "sort {input} > {output}"

Job 1: apply rule all

(a target rule that just collects results)

Job i: apply rule sort to create i-th input of job 1

use arbitrary Python code in your workflow

Dependencies are determined top-down

DATASETS = ["D1", "D2", "D3"]


rule all:
    input:
        expand("{dataset}.sorted.txt", dataset=DATASETS)




rule sort:
    input:
        "path/to/{dataset}.txt"
    output:
        "{dataset}.sorted.txt"
    shell:
        "sort {input} > {output}"

Job 1: apply rule all

(a target rule that just collects results)

Job i: apply rule sort to create i-th input of job 1

use arbitrary Python code in your workflow

Directed acyclic graph (DAG) of jobs

Job execution

A job is executed if and only if

  • output file is target and does not exist
  • output file needed by another executed job and does not exist
  • input file newer than output file
  • input file will be updated by other job
  • execution is enforced

determined via breadth-first-search on DAG of jobs

Command line interface

Assumption: workflow defined in a Snakefile in the same directory.

# execute the workflow with target D1.sorted.txt
snakemake D1.sorted.txt

# execute the workflow without target: first rule defines target
snakemake

# dry-run
snakemake -n

# dry-run, print shell commands
snakemake -n -p

# dry-run, print execution reason for each job
snakemake -n -r

# visualize the DAG of jobs using the Graphviz dot command
snakemake --dag | dot -Tsvg > dag.svg

Part 2

Advanced

Parallelization

Disjoint paths in the DAG of jobs can be executed in parallel.

# execute the workflow with 8 cores
snakemake --cores 8

execute 8 jobs in parallel?

Defining resources

rule sort:
    input:
        "path/to/{dataset}.txt"
    output:
        "{dataset}.sorted.txt"
    threads: 4
    resources: mem_mb=100
    shell:
        "sort --parallel {threads} {input} > {output}"

refer to defined thread number

define arbitrary additional resources

define used threads

Defining resources

rule sort:
    input:
        "path/to/{dataset}.txt"
    output:
        "{dataset}.sorted.txt"
    threads: 4
    resources: mem_mb=100
    shell:
        "sort --parallel {threads} {input} > {output}"
# execute the workflow with 8 cores
snakemake --cores 8

can execute 2 sort jobs in parallel

Defining resources

rule sort:
    input:
        "path/to/{dataset}.txt"
    output:
        "{dataset}.sorted.txt"
    threads: 4
    resources: mem_mb=100
    shell:
        "sort --parallel {threads} {input} > {output}"
# execute the workflow with 2 cores
snakemake --cores 2

can execute 1 sort job in parallel (automatically using 2 threads)

Defining resources

rule sort:
    input:
        "path/to/{dataset}.txt"
    output:
        "{dataset}.sorted.txt"
    threads: 4
    resources: mem_mb=100
    shell:
        "sort --parallel {threads} {input} > {output}"
# execute the workflow with 10 cores
snakemake --cores 10

can execute 2 sort jobs in parallel

Defining resources

rule sort:
    input:
        "path/to/{dataset}.txt"
    output:
        "{dataset}.sorted.txt"
    threads: 4
    resources: mem_mb=100
    shell:
        "sort --parallel {threads} {input} > {output}"
# execute the workflow with 10 cores and 100MB memory
snakemake --cores 10 --resources mem_mb=100

can execute 1 sort job in parallel

Scheduling

Available jobs are scheduled to

  • maximize parallelization
  • prefer high priority jobs

while satisfying resource constraints.

Scheduling

\max_{E \subseteq J} \sum_{j \in E}\, (p_j, t_j, d_j, i_j)^T
maxEJjE&ThinSpace;(pj,tj,dj,ij)T\max_{E \subseteq J} \sum_{j \in E}\, (p_j, t_j, d_j, i_j)^T
\sum_{j \in E} r_{ij} \leq R_i \text{ for } i=1,2,...,n
jErijRi for i=1,2,...,n\sum_{j \in E} r_{ij} \leq R_i \text{ for } i=1,2,...,n

s.t.

available jobs

priority

descendants

input size

resource usage

free resource (e.g. CPU cores)

temp input size

Config files

configfile: "config.yaml"


rule all:
    input:
        expand("{dataset}.sorted.txt", dataset=config["datasets"])


rule sort:
    input:
        "path/to/{dataset}.txt"
    output:
        "{dataset}.sorted.txt"
    threads: 4
    resources: mem_mb=100
    shell:
        "sort --parallel {threads} {input} > {output}"

define config file

refer to config values

Input functions

Workflows are executed in three phases

  • initialization phase (parsing)
  • DAG phase (DAG is built)
  • scheduling phase (execution of DAG)

Input functions defer determination of input files to the DAG phase

(when wildcard values are known).

Input functions

configfile: "config.yaml"


rule all:
    input:
        expand("{dataset}.sorted.txt", dataset=config["datasets"])


def get_sort_input(wildcards):
    return config["datasets"][wildcards.dataset]


rule sort:
    input:
        get_sort_input
    output:
        "{dataset}.sorted.txt"
    threads: 4
    resources: mem_mb=100
    shell:
        "sort --parallel {threads} {input} > {output}"

input functions take the determined wildcard values as only argument

Logging

configfile: "config.yaml"


rule all:
    input:
        expand("{dataset}.sorted.txt", dataset=config["datasets"])


def get_sort_input(wildcards):
    return config["datasets"][wildcards.dataset]


rule sort:
    input:
        get_sort_input
    output:
        "{dataset}.sorted.txt"
    log:
        "logs/sort/{dataset}.log"
    threads: 4
    resources: mem_mb=100
    shell:
        "sort --parallel {threads} {input} > {output} 2> {log}"

define log file

refer to log file from shell command

Cluster execution

# execute the workflow on cluster with qsub submission command
# (and up to 100 parallel jobs)
snakemake --cluster qsub --jobs 100

# tell the cluster system about the used threads
snakemake --cluster "qsub -pe threaded {threads}" --jobs 100

# execute the workflow with synchronized qsub
snakemake --cluster-sync "qsub -sync yes" --jobs 100

# execute the workflow with DRMAA
snakemake --drmaa --jobs 100

Reproducible software installation

dataset

results

dataset

dataset

dataset

dataset

dataset

Full reproducibility:

install required software and all dependencies in exact versions

Software installation is heterogeneous

source("https://bioconductor.org/biocLite.R")
biocLite("DESeq2")
easy_install snakemake
./configure --prefix=/usr/local
make
make install
cp lib/amd64/jli/*.so lib
cp lib/amd64/*.so lib
cp * $PREFIX
cpan -i bioperl
cmake ../../my_project \
    -DCMAKE_MODULE_PATH=~/devel/seqan/util/cmake \
    -DSEQAN_INCLUDE_PATH=~/devel/seqan/include
make
make install
apt-get install bwa
yum install python-h5py
install.packages("matrixpls")

Package management with

package:
  name: seqtk
  version: 1.2

source:
  fn: v1.2.tar.gz
  url: https://github.com/lh3/seqtk/archive/v1.2.tar.gz

requirements:
  build:
    - gcc
    - zlib
  run:
    - zlib

about:
  home: https://github.com/lh3/seqtk
  license: MIT License
  summary: Seqtk is a fast and lightweight tool for processing sequences

test:
  commands:
    - seqtk seq

Idea:

Normalize installation via recipes

#!/bin/bash

export C_INCLUDE_PATH=${PREFIX}/include
export LIBRARY_PATH=${PREFIX}/lib

make all
mkdir -p $PREFIX/bin
cp seqtk $PREFIX/bin
  • source or binary
  • recipe and build script
  • package

Easy installation and management:

no admin rights needed

conda install pandas

conda update pandas

conda remove pandas

conda env create -f myenv.yaml -n myenv

Isolated environments:

channels:
  - conda-forge
  - defaults
dependencies:
  - pandas ==0.20.3
  - statsmodels ==0.8.0
  - r-dplyr ==0.7.0
  - r-base ==3.4.1
  - python ==3.6.0

Package management with

  • Already over 3000 bioinformatics related packages (C, C++, Python, R, Perl, ...).
  • Over 6 million downloads.

Over 200 contributors

CONDA-FORGE

Partner project for general purpose software:

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    conda:
        "envs/mycommand.yaml"
    shell:
        "mycommand {input} > {output}"

Integration with Snakemake

channels:
  - bioconda
  - conda-forge
  - defaults
dependencies:
  - mycommand ==2.3.1
# automatic deployment of dependencies
snakemake --use-conda

Integrated with popular workflow management systems

Conda Integration with WMS

External scripts

rule sort:
    input:
        a="path/to/{dataset}.txt"
    output:
        b="{dataset}.sorted.txt"
    script:
        "scripts/myscript.R"

refer to Python or R scripts

(in version 3.5)

External scripts

data <- read.table(snakemake@input[["a"]])
data <- data[order(data$id),]
write.table(data, file = snakemake@output[["b"]])

R scripts:

External scripts

import pandas as pd

def sort(infile, outfile):
    data = pd.read_table(infile)
    data = data.sort_values("id")
    data.to_csv(outfile, sep="\t")


if __name__ == "__main__":
    sort(snakemake.input.a, snakemake.output.b)

Python scripts:

Wrappers

rule samtools_sort:
    input:
        "mapped/{sample}.bam"
    output:
        "mapped/{sample}.sorted.bam"
    params:
        "-m 4G"
    threads: 8
    wrapper:
        "0.2.0/bio/samtools/sort"
# automatic deployment of dependencies
snakemake --use-conda

Wrappers

Benchmarking

rule mytool:
    input:
        a="path/to/{dataset}.txt"
    output:
        b="mytool/{dataset}.txt"
    benchmark:
        "benchmarks/mytool/{dataset}.tsv"
    shell:
        "mytool {input} {output}"
# repeat benchmark three times
snakemake --benchmark-repeats 3

Distribution of workflows

Git repository with

├── .gitignore
├── README.md
├── LICENSE.md
├── config.yaml
├── envs
│   ├── env1.yaml
│   └── env2.yaml
├── rules
│   ├── rules1.smk
│   └── rules2.smk
├── scripts
│   ├── script1.py
│   └── script2.R
└── Snakefile
# clone workflow into working directory
git clone https://bitbucket.org/user/myworkflow.git path/to/workdir
cd path/to/workdir

# edit config and workflow as needed
vim config.yaml

# execute workflow
snakemake -n --use-conda

Sustainable publishing

# archive workflow (including Conda packages)
snakemake --archive myworkflow.tar.gz

Author:

  1.  
  2. Upload to Zenodo and acquire DOI
  3. Cite DOI

Reviewer/reader:

  1. Download and unpack workflow archive
  2.  
# execute workflow (Conda packages are deployed automatically)
snakemake --use-conda --cores 16

Many additional features

  • modularization
  • handling of temporary and protected files
  • HTML reports
  • tracking of parameter and code changes
  • remote file support (S3/Google, Dropbox, HTTPS, FTP, ...)
  • container support via Singularity

Acknowledgements

Contributors:

Andreas Wilm

Anthony Underwood

Ryan Dale

David Alexander

Elias Kuthe

Elmar Pruesse

Hyeshik Chang

Jay Hesselberth

Jesper Foldager

John Huddleston

all users and supporters

Joona Lehtomäki

Justin Fear

Karel Brinda

Karl Gutwin

Kemal Eren

Kostis Anagnostopoulos

Kyle A. Beauchamp

Simon Ye

Tobias Marschall

Willem Ligtenberg

Development team:

Christopher Tomkins-Tinch

David Koppstein

Tim Booth

Manuel Holtgrewe

Christian Arnold

Wibowo Arindrarto

Rasmus Ågren

 

Kyle Meyer

Lance Parsons

Manuel Holtgrewe

Marcel Martin

Matthew Shirley

Mattias Franberg

Matt Shirley

Paul Moore

percyfal

Per Unneberg

 

Ryan C. Thompson

Ryan Dale

Sean Davis

Resources

Documentation and change log:

https://snakemake.readthedocs.io

 

Questions:

http://stackoverflow.com/questions/tagged/snakemake

 

Gold standard workflows:

https://github.com/snakemake-workflows/docs

 

Configuration profiles:

https://github.com/snakemake-profiles/docs

https://snakemake.bitbucket.io

Köster, Johannes and Rahmann, Sven. "Snakemake - A scalable bioinformatics workflow engine". Bioinformatics 2012.

Köster, Johannes. "Parallelization, Scalability, and Reproducibility in Next-Generation Sequencing Analysis", PhD thesis, TU Dortmund 2014.

Resources