Introduction to Snakemake

Johannes Köster

2015

Why workflow management?

Analyses usually entail the application of various tools, algorithms and scripts.

Workflow management

handles boilerplate:

parallelization
suspend/resume
logging
data provenance

Snakemake

Snakemake infers dependencies and execution order.

text based:

Python + domain specific syntax

Decompose workflow into rules.

Rules define how to obtain output files from input files.

Stats

~1000 downloads per week

Define workflows

in terms of rules

rule sort:
    input:
        "path/to/dataset.txt"
    output:
        "dataset.sorted.txt"
    shell:
        "sort {input} > {output}"

rule name

refer to input and output from shell command

how to create output from input

Define workflows

in terms of rules

rule sort:
    input:
        "path/to/{dataset}.txt"
    output:
        "{dataset}.sorted.txt"
    shell:
        "sort {input} > {output}"

generalize rules with

named wildcards

Dependencies are determined automatically

Parallelization

Disjoint paths in the DAG of jobs can be executed in parallel.

Rules can be annotated with arbitrary resources that can be used to constrain the scheduling.

Many additional features

scaling from workstation to cluster without workflow modification
modularization
handling of temporary and protected files
HTML5 reports
rule parameters
tracking of tool versions and code changes
per file data provenance information
a Python API for embedding Snakemake in other tools

What's new?

rule samtools_sort:
    input:
        "mapped/{sample}.bam"
    output:
        "mapped/{sample}.sorted.bam"
    params:
        "-m 4G"
    threads: 8
    wrapper:
        "0.0.8/bio/samtools_sort"

Wrappers

What's coming?

rule bwa:
    input:
        "genome.fasta",
        "{sample}.fastq"
    output:
        "{sample}.bam"
    requirements:
        "bwa ==0.7.12",
        "samtools ==1.1.0"
    shell:
        "bwa mem {input} | "
        "samtools view - > {output}"

Conda support

What's coming?

DAG partitioning

Introduction to Snakemake

By Johannes Köster

Introduction to Snakemake

AbVitro visit 2016

2,125

Introduction to Snakemake

Why workflow management?

Snakemake

Stats

Define workflows

in terms of rules

Define workflows

in terms of rules

Dependencies are determined automatically

Parallelization

Many additional features

What's new?

What's coming?

What's coming?

Introduction to Snakemake

More from Johannes Köster