Transparent, adaptable, and reproducible

data analysis

with Snakemake

Johannes Köster

University of Duisburg-Essen

dataset

results

Data analysis

"Let me do that by hand..."

dataset

results

dataset

"Let me do that by hand..."

Data analysis

check computational validity
apply same analysis to new data

check methodological validity
understand analysis

Data analysis

Reproducibility

Transparency

modify analysis
extend analysis

Adaptability

>567k downloads since 2015

>1500 citations

>7 citations per week in 2021

automation
scalability
portability

readability
documentation
traceability

Data analysis

Reproducibility

Transparency

readability
portability
scalability

Adaptability

Data analysis

automation
scalability
portability

readability
documentation
traceability

Reproducibility

Transparency

readability
portability
scalability

Adaptability

dataset

results

dataset

Define workflows

in terms of rules

Define workflows

in terms of rules

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    script:
        "scripts/myscript.R"


rule myfiltration:
     input:
        "result/{dataset}.txt"
     output:
        "result/{dataset}.filtered.txt"
     shell:
        "mycommand {input} > {output}"


rule aggregate:
    input:
        "results/dataset1.filtered.txt",
        "results/dataset2.filtered.txt"
    output:
        "plots/myplot.pdf"
    script:
        "scripts/myplot.R"

Define workflows

in terms of rules

Define workflows

in terms of rules

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        "result/{sample}.txt"
    shell:
        "some-tool {input} > {output}"

rule name

how to create output from input

define

input
output
log files
parameters
resources

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    script:
        "scripts/myscript.R"


rule myfiltration:
     input:
        "result/{dataset}.txt"
     output:
        "result/{dataset}.filtered.txt"
     shell:
        "mycommand {input} > {output}"


rule aggregate:
    input:
        "results/dataset1.filtered.txt",
        "results/dataset2.filtered.txt"
    output:
        "plots/myplot.pdf"
    script:
        "scripts/myplot.R"

Define workflows

in terms of rules

Dependencies are determined top-down

Solution:

For a given target, a rule that can be applied to create it is determined (a job).
For the input files of the rule, go on recursively.
If no target is specified, Snakemake tries to apply the first rule in the workflow.

Problem:

for a given set of targets, find a composition of rules to create them

Dependencies are determined top-down




rule all:
    input:
        "D1.sorted.txt",
        "D2.sorted.txt",
        "D3.sorted.txt"


rule sort:
    input:
        "path/to/{dataset}.txt"
    output:
        "{dataset}.sorted.txt"
    shell:
        "sort {input} > {output}"

Job 1: apply rule all

(a target rule that just collects results)

Job i: apply rule sort to create i-th input of job 1

Dependencies are determined top-down

DATASETS = ["D1", "D2", "D3"]


rule all:
    input:
        ["{dataset}.sorted.txt".format(dataset=ds)
         for ds in DATASETS]



rule sort:
    input:
        "path/to/{dataset}.txt"
    output:
        "{dataset}.sorted.txt"
    shell:
        "sort {input} > {output}"

Job 1: apply rule all

(a target rule that just collects results)

Job i: apply rule sort to create i-th input of job 1

use arbitrary Python code in your workflow

Dependencies are determined top-down

DATASETS = ["D1", "D2", "D3"]


rule all:
    input:
        expand("{dataset}.sorted.txt", dataset=DATASETS)




rule sort:
    input:
        "path/to/{dataset}.txt"
    output:
        "{dataset}.sorted.txt"
    shell:
        "sort {input} > {output}"

Job 1: apply rule all

(a target rule that just collects results)

Job i: apply rule sort to create i-th input of job 1

use arbitrary Python code in your workflow

Directed acyclic graph (DAG) of jobs

Boilerplate-free integration of scripts

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        "result/{sample}.txt"
    script:
        "scripts/myscript.py"

reusable scripts:

Python
R
Julia
Rust

import pandas as pd

data = pd.read_table(snakemake.input[0])
data = data.sort_values("id")
data.to_csv(snakemake.output[0], sep="\t")

Python:

data <- read.table(snakemake@input[[1]])
data <- data[order(data$id),]
write.table(data, file = snakemake@output[[1]])

Boilerplate-free integration of scripts

import polar as pl

pl.read_csv(&snakemake.input[0])
  .sort()
  .to_csv(&snakemake.output[0])

Rust:

Jupyter notebook integration

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        "result/{sample}.txt"
    notebook:
        "notebooks/mynotebook.ipynb"

Integrated interactive edit mode.
Automatic generalization for reuse in other jobs.

Reusable wrappers

rule map_reads:
    input:
        "{sample}.bam"
    output:
        "{sample}.sorted.bam"
    wrapper:
        "0.22.0/bio/samtools/sort"

reuseable wrappers from central repository

Reusable wrappers

Output handling

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        temp("result/{sample}.txt")
    shell:
        "some-tool {input} > {output}"

Output handling

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        protected("result/{sample}.txt")
    shell:
        "some-tool {input} > {output}"

Output handling

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        pipe("result/{sample}.txt")
    shell:
        "some-tool {input} > {output}"

Using and combining workflows

configfile: "config/config.yaml"

module dna_seq:
    snakefile:
        github("snakemake-workflows/dna-seq-gatk-variant-calling", path="workflow/Snakefile", tag="v1.17.0")
    config:
        config

use rule * from dna_seq

# easily extend the workflow
rule plot_vafs:
    input:
        "filtered/all.vcf.gz"
    output:
        "results/plots/vafs.svg"
    notebook:
        "notebooks/plot-vafs.py.ipynb"

Using and combining workflows

configfile: "config/config.yaml"

module dna_seq:
    snakefile:
        github("snakemake-workflows/dna-seq-gatk-variant-calling", path="workflow/Snakefile", tag="v1.17.0")
    config:
        config["dna-seq"]

use rule * from dna_seq as dna_seq_*

# easily extend the workflow
rule plot_vafs:
    input:
        "filtered/all.vcf.gz"
    output:
        "results/plots/vafs.svg"
    notebook:
        "notebooks/plot-vafs.py.ipynb"

module rna_seq:
    snakefile:
        github("snakemake-workflows/rna-seq-kallisto-sleuth", path="workflow/Snakefile", tag="v1.0.0")
    config:
        config["rna-seq"]

use rule * from rna_seq as rna_seq_*

Data analysis

automation
scalability
portability

readability
documentation
traceability

Reproducibility

Transparency

readability
portability
scalability

Adaptability

Job execution

A job is executed if and only if

output file is target and does not exist
output file needed by another executed job and does not exist
input file newer than output file
input file will be updated by other job
any change in the rule (code, params, software stack)
execution is enforced

determined via breadth-first-search on DAG of jobs

Resources and priorities

rule sort:
    input:
        "path/to/{dataset}.txt"
    output:
        "{dataset}.sorted.txt"
    priority: 1
    threads: 4
    resources: mem_mb=100
    shell:
        "sort --parallel {threads} {input} > {output}"

refer to defined thread number

define arbitrary additional resources

define used threads

\max U_t \cdot 2S \cdot \sum_{j \in J} x_j \cdot p_j + 2S \cdot \sum_{j \in J} x_j \cdot u_{t,j}\\ + S \cdot \sum_{f \in F} \gamma_f \cdot S_f + \sum_{f \in F} \delta_f \cdot S_f

\sum_{j \in J} x_j \cdot u_{r,j} \leq U_r \quad \forall r \in R

\delta_f \leq \frac{\sum_{j \in J} x_j \cdot z_{f,j}}{\sum_{j \in J} z_{f,j}} \quad\forall f \in F

\text{subject to:}

job selection

job resource usage

free resources

job temp file consumption

temp file lifetime fraction

job priority

job thread usage

Scheduling

temp file size

temp file deletion

\gamma_f \leq \delta_f \quad\forall f \in F

\gamma_f \in \{0,1\}

\delta_f \in [0,1]

x_j \in \{0,1\}

DAG partitioning

--groups a=g1 b=g1

--groups a=g1 b=g1
--group-components g1=2

--groups a=g1 b=g1
--group-components g1=5

Scalable to any platform

workstation

compute server

cluster

grid computing

cloud computing

Between workflow caching

dataset

results

dataset

shared data

Between workflow caching

Data analysis

automation
scalability
portability

readability
documentation
traceability

Reproducibility

Transparency

readability
portability
scalability

Adaptability

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    conda:
        "envs/some-tool.yaml"
    shell:
        "some-tool {input} > {output}"

Conda integration

channels:
 - conda-forge
dependencies:
  - some-tool =2.3.1
  - some-lib =1.1.2

Container integration

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    container:
        "docker://biocontainers/some-tool#2.3.1"
    shell:
        "some-tool {input} > {output}"

Containerization

containerized:
    "docker://username/myworkflow:1.0.0"


rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    conda:
        "envs/some-tool.yaml"
    shell:
        "some-tool {input} > {output}"

snakemake --containerize > Dockerfile

Data analysis

automation
scalability
portability

readability
documentation
traceability

Reproducibility

Transparency

readability
portability
scalability

Adaptability

Self-contained HTML reports

Snakemake workflow catalog

Conclusion

Besides reproducibility, transparency and adaptability are equally important.
Almost never, research can be done in a single big script or package.
Snakemake helps to avoid boilerplate for deployment, data provenance and scalability, and make analyses more structured.
Snakemake further allows reporting with minimal effort.

https://snakemake.github.io

Best practices

├── .gitignore
├── README.md
├── LICENSE.md
├── workflow
│   ├── resources
|   │   └── someresouce.bed
│   ├── rules
|   │   ├── some_task.smk
|   │   └── some_other_task.smk
│   ├── envs
|   │   ├── tool1.yaml
|   │   └── tool2.yaml
│   ├── scripts
|   │   ├── script1.py
|   │   └── script2.R
│   ├── notebooks
|   │   ├── notebook1.py.ipynb
|   │   └── notebook2.r.ipynb
│   ├── report
|   │   ├── plot1.rst
|   │   └── plot2.rst
|   └── Snakefile
├── config
│   ├── config.yaml
│   └── some-sheet.tsv
├── results
└── resources

Stick to a standardized structure

Goal:

Help readers to navigate to those parts they are interested in.

separate config, code, resources, and results
have separate folders for separate parts of the source
partition workflow into larger tasks which are performed by sets of rules, have a separate file per task

Code readability

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    log:
        "logs/mytask/{dataset}.log"
    conda:
      	"envs/myenv.yaml"
    shell:
        "some-tool {input} > {output} 2> {log}"

simple, short, descriptive names
reasonable directory structure
use shell for short commands, use script and notebook for complex stuff
put non-rule code (helper functions, config parsing) into separate module, e.g. workflow/rules/common.smk
conda/container for every rule
prefer small envs over large ones (maintainability, performance)
define log files
think about which rules' outputs shall be cached across workflows

Transparent, adaptable, and reproducible

data analysis

with Snakemake

Data analysis

Data analysis

Data analysis

Data analysis

Data analysis

Define workflows

in terms of rules

Define workflows

in terms of rules

Define workflows

in terms of rules

Define workflows

in terms of rules

Define workflows

in terms of rules

Dependencies are determined top-down

Dependencies are determined top-down

Dependencies are determined top-down

Dependencies are determined top-down

Directed acyclic graph (DAG) of jobs

Boilerplate-free integration of scripts

Boilerplate-free integration of scripts

Jupyter notebook integration

Reusable wrappers

Reusable wrappers

Output handling

Output handling

Output handling

Using and combining workflows

Using and combining workflows

Data analysis

Job execution

Resources and priorities

Scheduling

DAG partitioning

Scalable to any platform

Between workflow caching

Between workflow caching

Data analysis

Conda integration

Container integration

Containerization

Data analysis

Self-contained HTML reports

Snakemake workflow catalog

Conclusion

Best practices

Stick to a standardized structure

Code readability

Snakemake lecture

More from Johannes Köster