Johannes Köster
2020
https://koesterlab.github.io
Reproducible data analysis with
https://snakemake.readthedocs.io
dataset
results
"Let me do that by hand..."
https://snakemake.readthedocs.io
dataset
results
dataset
dataset
dataset
dataset
dataset
"Let me do that by hand..."
https://snakemake.readthedocs.io
dataset
results
dataset
dataset
dataset
dataset
dataset
automation
From raw data to final figures:
https://snakemake.readthedocs.io
dataset
results
dataset
dataset
dataset
dataset
dataset
scalability
Handle parallelization:
automation
https://snakemake.readthedocs.io
dataset
results
dataset
dataset
dataset
dataset
dataset
Handle deployment:
be able to easily execute analyses on a different system/platform/infrastructure
portability
scalability
automation
https://snakemake.readthedocs.io
214k downloads since 2015
611 citations (+359 in 2018 and 2019)
~3 citations per week
https://snakemake.readthedocs.io
dataset
results
dataset
dataset
dataset
dataset
dataset
scalability
automation
portability
https://snakemake.readthedocs.io
dataset
results
dataset
dataset
dataset
dataset
dataset
https://snakemake.readthedocs.io
https://snakemake.readthedocs.io
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
script:
"scripts/myscript.R"
rule myfiltration:
input:
"result/{dataset}.txt"
output:
"result/{dataset}.filtered.txt"
shell:
"mycommand {input} > {output}"
rule aggregate:
input:
"results/dataset1.filtered.txt",
"results/dataset2.filtered.txt"
output:
"plots/myplot.pdf"
script:
"scripts/myplot.R"
https://snakemake.readthedocs.io
rule mytask:
input:
"data/{sample}.txt"
output:
"result/{sample}.txt"
shell:
"some-tool {input} > {output}"
rule name
how to create output from input
define
https://snakemake.readthedocs.io
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
script:
"scripts/myscript.R"
rule myfiltration:
input:
"result/{dataset}.txt"
output:
"result/{dataset}.filtered.txt"
shell:
"mycommand {input} > {output}"
rule aggregate:
input:
"results/dataset1.filtered.txt",
"results/dataset2.filtered.txt"
output:
"plots/myplot.pdf"
script:
"scripts/myplot.R"
https://snakemake.readthedocs.io
rule mytask:
input:
"data/{sample}.txt"
output:
"result/{sample}.txt"
script:
"scripts/myscript.py"
reusable
Python/R/Julia scripts
https://snakemake.readthedocs.io
import pandas as pd
data = pd.read_table(snakemake.input[0])
data = data.sort_values("id")
data.to_csv(snakemake.output[0], sep="\t")
Python scripts:
https://snakemake.readthedocs.io
rule mytask:
input:
"data/{sample}.txt"
output:
"result/{sample}.txt"
notebook:
"notebooks/mynotebook.ipynb"
https://snakemake.readthedocs.io
rule map_reads:
input:
"{sample}.bam"
output:
"{sample}.sorted.bam"
wrapper:
"0.22.0/bio/samtools/sort"
reuseable wrappers from central repository
https://snakemake.readthedocs.io
rule mytask:
input:
"data/{sample}.txt"
output:
temp("result/{sample}.txt")
shell:
"some-tool {input} > {output}"
https://snakemake.readthedocs.io
rule mytask:
input:
"data/{sample}.txt"
output:
protected("result/{sample}.txt")
shell:
"some-tool {input} > {output}"
https://snakemake.readthedocs.io
rule mytask:
input:
"data/{sample}.txt"
output:
pipe("result/{sample}.txt")
shell:
"some-tool {input} > {output}"
https://snakemake.readthedocs.io
dataset
results
dataset
dataset
dataset
dataset
dataset
scalability
automation
portability
https://snakemake.readthedocs.io
Paradigm:
Workflow definition shall be independent of computing platform and available resources
Rules:
define resource usage (threads, memory, ...)
Scheduler:
https://snakemake.readthedocs.io
workstation
compute server
cluster
grid computing
cloud computing
https://snakemake.readthedocs.io
# perfom dry-run
snakemake -n
# execute workflow locally with 16 CPU cores
snakemake --cores 16
# execute on cluster
snakemake --cluster qsub --jobs 100
# execute in the cloud
snakemake --kubernetes --jobs 1000 --default-remote-provider GS --default-remote-prefix mybucket
https://snakemake.readthedocs.io
dataset
results
dataset
dataset
dataset
dataset
dataset
shared data
https://snakemake.readthedocs.io
https://snakemake.readthedocs.io
dataset
results
dataset
dataset
dataset
dataset
dataset
Full reproducibility:
install required software and all dependencies in exact versions
portability
scalability
automation
https://snakemake.readthedocs.io
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
conda:
"envs/some-tool.yaml"
shell:
"some-tool {input} > {output}"
channels:
- conda-forge
dependencies:
- some-tool =2.3.1
- some-lib =1.1.2
https://snakemake.readthedocs.io
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
container:
"docker://biocontainers/some-tool#2.3.1"
shell:
"some-tool {input} > {output}"
https://snakemake.readthedocs.io
container:
"docker://continuumio/miniconda3:4.4.1"
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
conda:
"envs/some-tool.yaml"
shell:
"some-tool {input} > {output}"
define OS
define tools/libs
https://snakemake.readthedocs.io
https://snakemake.readthedocs.io
https://snakemake.readthedocs.io
With
Snakemake covers all three dimensions of fully reproducible data analysis.
portability
scalability
automation
https://snakemake.readthedocs.io
Contributors:
Andreas Wilm
Anthony Underwood
Ryan Dale
David Alexander
Elias Kuthe
Elmar Pruesse
Hyeshik Chang
Jay Hesselberth
Jesper Foldager
John Huddleston
all users and supporters
Joona Lehtomäki
Justin Fear
Karel Brinda
Karl Gutwin
Kemal Eren
Kostis Anagnostopoulos
Kyle A. Beauchamp
Simon Ye
Tobias Marschall
Willem Ligtenberg
Development team:
Christopher Tomkins-Tinch
David Koppstein
Tim Booth
Manuel Holtgrewe
Christian Arnold
Wibowo Arindrarto
Rasmus Ågren
Kyle Meyer
Lance Parsons
Manuel Holtgrewe
Marcel Martin
Matthew Shirley
Mattias Franberg
Matt Shirley
Paul Moore
percyfal
Per Unneberg
Ryan C. Thompson
Ryan Dale
Sean Davis
https://snakemake.readthedocs.io