Introduction to Snakemake
Johannes Köster
2015
Why workflow management?
Analyses usually entail the application of various tools, algorithms and scripts.
Workflow management
handles boilerplate:
- parallelization
- suspend/resume
- logging
- data provenance
Snakemake
Snakemake infers dependencies and execution order.
text based:
Python + domain specific syntax
Decompose workflow into rules.
Rules define how to obtain output files from input files.
Stats
~1000 downloads per week
Define workflows
in terms of rules
rule sort:
input:
"path/to/dataset.txt"
output:
"dataset.sorted.txt"
shell:
"sort {input} > {output}"
rule name
refer to input and output from shell command
how to create output from input
Define workflows
in terms of rules
rule sort:
input:
"path/to/{dataset}.txt"
output:
"{dataset}.sorted.txt"
shell:
"sort {input} > {output}"
generalize rules with
named wildcards
Dependencies are determined automatically
Parallelization
Disjoint paths in the DAG of jobs can be executed in parallel.
Rules can be annotated with arbitrary resources that can be used to constrain the scheduling.
Many additional features
- scaling from workstation to cluster without workflow modification
- modularization
- handling of temporary and protected files
- HTML5 reports
- rule parameters
- tracking of tool versions and code changes
- per file data provenance information
- a Python API for embedding Snakemake in other tools
What's new?
rule samtools_sort:
input:
"mapped/{sample}.bam"
output:
"mapped/{sample}.sorted.bam"
params:
"-m 4G"
threads: 8
wrapper:
"0.0.8/bio/samtools_sort"
Wrappers
What's coming?
rule bwa:
input:
"genome.fasta",
"{sample}.fastq"
output:
"{sample}.bam"
requirements:
"bwa ==0.7.12",
"samtools ==1.1.0"
shell:
"bwa mem {input} | "
"samtools view - > {output}"
Conda support
What's coming?
DAG partitioning
Introduction to Snakemake
By Johannes Köster
Introduction to Snakemake
AbVitro visit 2016
- 1,991