Reproducible data analysis with Snakemake (25min)

Johannes Köster

2020

https://koesterlab.github.io

Reproducible data analysis with

https://snakemake.readthedocs.io

dataset

results

Data analysis

"Let me do that by hand..."

https://snakemake.readthedocs.io

dataset

results

dataset

"Let me do that by hand..."

Data analysis

https://snakemake.readthedocs.io

dataset

results

dataset

automation

From raw data to final figures:

document parameters, tools, versions
execute without manual intervention

Reproducible data analysis

https://snakemake.readthedocs.io

dataset

results

dataset

scalability

Handle parallelization:

execute for tens to thousands of datasets
efficiently use any computing platform

automation

Reproducible data analysis

https://snakemake.readthedocs.io

dataset

results

dataset

Handle deployment:

be able to easily execute analyses on a different system/platform/infrastructure

portability

scalability

automation

Reproducible data analysis

https://snakemake.readthedocs.io

214k downloads since 2015

Snakemake is popular

611 citations (+359 in 2018 and 2019)

~3 citations per week

https://snakemake.readthedocs.io

dataset

results

dataset

scalability

automation

portability

https://snakemake.readthedocs.io

dataset

results

dataset

Define workflows

in terms of rules

https://snakemake.readthedocs.io

Define workflows

in terms of rules

https://snakemake.readthedocs.io

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    script:
        "scripts/myscript.R"


rule myfiltration:
     input:
        "result/{dataset}.txt"
     output:
        "result/{dataset}.filtered.txt"
     shell:
        "mycommand {input} > {output}"


rule aggregate:
    input:
        "results/dataset1.filtered.txt",
        "results/dataset2.filtered.txt"
    output:
        "plots/myplot.pdf"
    script:
        "scripts/myplot.R"

Define workflows

in terms of rules

https://snakemake.readthedocs.io

Define workflows

in terms of rules

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        "result/{sample}.txt"
    shell:
        "some-tool {input} > {output}"

rule name

how to create output from input

define

input
output
log files
parameters
resources

https://snakemake.readthedocs.io

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    script:
        "scripts/myscript.R"


rule myfiltration:
     input:
        "result/{dataset}.txt"
     output:
        "result/{dataset}.filtered.txt"
     shell:
        "mycommand {input} > {output}"


rule aggregate:
    input:
        "results/dataset1.filtered.txt",
        "results/dataset2.filtered.txt"
    output:
        "plots/myplot.pdf"
    script:
        "scripts/myplot.R"

Define workflows

in terms of rules

https://snakemake.readthedocs.io

Boilerplate-free integration of scripts

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        "result/{sample}.txt"
    script:
        "scripts/myscript.py"

reusable

Python/R/Julia scripts

https://snakemake.readthedocs.io

import pandas as pd

data = pd.read_table(snakemake.input[0])
data = data.sort_values("id")
data.to_csv(snakemake.output[0], sep="\t")

Python scripts:

Boilerplate-free integration of scripts

https://snakemake.readthedocs.io

Jupyter notebook integration

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        "result/{sample}.txt"
    notebook:
        "notebooks/mynotebook.ipynb"

Integrated interactive edit mode.
Automatic generalization for reuse in other jobs.

https://snakemake.readthedocs.io

Reusable wrappers

rule map_reads:
    input:
        "{sample}.bam"
    output:
        "{sample}.sorted.bam"
    wrapper:
        "0.22.0/bio/samtools/sort"

reuseable wrappers from central repository

https://snakemake.readthedocs.io

Output handling

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        temp("result/{sample}.txt")
    shell:
        "some-tool {input} > {output}"

https://snakemake.readthedocs.io

Output handling

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        protected("result/{sample}.txt")
    shell:
        "some-tool {input} > {output}"

https://snakemake.readthedocs.io

Output handling

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        pipe("result/{sample}.txt")
    shell:
        "some-tool {input} > {output}"

https://snakemake.readthedocs.io

dataset

results

dataset

scalability

automation

portability

https://snakemake.readthedocs.io

Scheduling

Paradigm:

Workflow definition shall be independent of computing platform and available resources

Rules:

define resource usage (threads, memory, ...)

Scheduler:

solves multidimensional knapsack problem
schedules independent jobs in parallel
passes resource requirements to any backend

https://snakemake.readthedocs.io

Scalable to any platform

workstation

compute server

cluster

grid computing

cloud computing

https://snakemake.readthedocs.io

Command-line interface

# perfom dry-run
snakemake -n

# execute workflow locally with 16 CPU cores
snakemake --cores 16


# execute on cluster
snakemake --cluster qsub --jobs 100


# execute in the cloud
snakemake --kubernetes --jobs 1000 --default-remote-provider GS --default-remote-prefix mybucket

https://snakemake.readthedocs.io

Between workflow caching

dataset

results

dataset

shared data

https://snakemake.readthedocs.io

Between workflow caching

https://snakemake.readthedocs.io

dataset

results

dataset

Full reproducibility:

install required software and all dependencies in exact versions

portability

scalability

automation

https://snakemake.readthedocs.io

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    conda:
        "envs/some-tool.yaml"
    shell:
        "some-tool {input} > {output}"

Conda integration

channels:
 - conda-forge
dependencies:
  - some-tool =2.3.1
  - some-lib =1.1.2

https://snakemake.readthedocs.io

Container integration

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    container:
        "docker://biocontainers/some-tool#2.3.1"
    shell:
        "some-tool {input} > {output}"

https://snakemake.readthedocs.io

Containers + Conda

container:
    "docker://continuumio/miniconda3:4.4.1"


rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    conda:
        "envs/some-tool.yaml"
    shell:
        "some-tool {input} > {output}"

define OS

define tools/libs

https://snakemake.readthedocs.io

Self-contained HTML reports

https://snakemake.readthedocs.io

More features

conditional execution
graph partitioning
resource-constrained scheduling
various ways to constrain or enforce job execution
data provenance and log file handling
CWL integration
...

https://snakemake.readthedocs.io

Conclusion

With

the human readable specification language
reusable modularization capabilities
seamless execution on all platforms without adaptation of the workflow definition
integrated package management and containerization

Snakemake covers all three dimensions of fully reproducible data analysis.

portability

scalability

automation

https://snakemake.readthedocs.io

Acknowledgements

Contributors:

Andreas Wilm

Anthony Underwood

Ryan Dale

David Alexander

Elias Kuthe

Elmar Pruesse

Hyeshik Chang

Jay Hesselberth

Jesper Foldager

John Huddleston

all users and supporters

Joona Lehtomäki

Justin Fear

Karel Brinda

Karl Gutwin

Kemal Eren

Kostis Anagnostopoulos

Kyle A. Beauchamp

Simon Ye

Tobias Marschall

Willem Ligtenberg

Development team:

Christopher Tomkins-Tinch

David Koppstein

Tim Booth

Manuel Holtgrewe

Christian Arnold

Wibowo Arindrarto

Rasmus Ågren

Kyle Meyer

Lance Parsons

Manuel Holtgrewe

Marcel Martin

Matthew Shirley

Mattias Franberg

Matt Shirley

Paul Moore

percyfal

Per Unneberg

Ryan C. Thompson

Ryan Dale

Sean Davis

https://snakemake.readthedocs.io