Introduction to Snakemake

Johannes Köster

 

2015

Why workflow management?

Analyses usually entail the application of various tools, algorithms and scripts.

Workflow management

handles boilerplate:

  • parallelization
  • suspend/resume
  • logging
  • data provenance

Snakemake

Snakemake infers dependencies and execution order.

text based:

Python + domain specific syntax

Decompose workflow into rules.

Rules define how to obtain output files from input files.

Stats

~1000 downloads per week

Define workflows

in terms of rules

rule sort:
    input:
        "path/to/dataset.txt"
    output:
        "dataset.sorted.txt"
    shell:
        "sort {input} > {output}"

rule name

refer to input and output from shell command

how to create output from input

Define workflows

in terms of rules

rule sort:
    input:
        "path/to/{dataset}.txt"
    output:
        "{dataset}.sorted.txt"
    shell:
        "sort {input} > {output}"

generalize rules with

named wildcards

Dependencies are determined automatically

Parallelization

Disjoint paths in the DAG of jobs can be executed in parallel.

Rules can be annotated with arbitrary resources that can be used to constrain the scheduling.

Many additional features

  • scaling from workstation to cluster without workflow modification
  • modularization
  • handling of temporary and protected files
  • HTML5 reports
  • rule parameters
  • tracking of tool versions and code changes
  • per file data provenance information
  • a Python API for embedding Snakemake in other tools

What's new?

rule samtools_sort:
    input:
        "mapped/{sample}.bam"
    output:
        "mapped/{sample}.sorted.bam"
    params:
        "-m 4G"
    threads: 8
    wrapper:
        "0.0.8/bio/samtools_sort"

Wrappers

What's coming?

rule bwa:
    input:
        "genome.fasta",
        "{sample}.fastq"
    output:
        "{sample}.bam"
    requirements:
        "bwa ==0.7.12",
        "samtools ==1.1.0"
    shell:
        "bwa mem {input} | "
        "samtools view - > {output}"

Conda support

What's coming?

DAG partitioning

Introduction to Snakemake

By Johannes Köster

Introduction to Snakemake

AbVitro visit 2016

  • 1,991