Transparent, adaptable, and reproducible

data analysis

with Snakemake

Johannes Köster

 

University of Duisburg-Essen

dataset

results

Data analysis

"Let me do that by hand..."

dataset

results

dataset

dataset

dataset

dataset

dataset

"Let me do that by hand..."

Data analysis

  • check computational validity
  • apply same analysis to new data
  • check methodological validity
  • understand analysis

Data analysis

Reproducibility

Transparency

  • modify analysis
  • extend analysis

Adaptability

>567k downloads since 2015

>1500 citations

>7 citations per week in 2021

  • automation
  • scalability
  • portability
  • readability
  • documentation
  • traceability

Data analysis

Reproducibility

Transparency

  • readability
  • portability
  • scalability

Adaptability

Data analysis

  • automation
  • scalability
  • portability
  • readability
  • documentation
  • traceability

Reproducibility

Transparency

  • readability
  • portability
  • scalability

Adaptability

dataset

results

dataset

dataset

dataset

dataset

dataset

Define workflows

in terms of rules

Define workflows

in terms of rules

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    script:
        "scripts/myscript.R"


rule myfiltration:
     input:
        "result/{dataset}.txt"
     output:
        "result/{dataset}.filtered.txt"
     shell:
        "mycommand {input} > {output}"


rule aggregate:
    input:
        "results/dataset1.filtered.txt",
        "results/dataset2.filtered.txt"
    output:
        "plots/myplot.pdf"
    script:
        "scripts/myplot.R"

Define workflows

in terms of rules

Define workflows

in terms of rules

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        "result/{sample}.txt"
    shell:
        "some-tool {input} > {output}"

rule name

how to create output from input

define

  • input
  • output
  • log files
  • parameters
  • resources
rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    script:
        "scripts/myscript.R"


rule myfiltration:
     input:
        "result/{dataset}.txt"
     output:
        "result/{dataset}.filtered.txt"
     shell:
        "mycommand {input} > {output}"


rule aggregate:
    input:
        "results/dataset1.filtered.txt",
        "results/dataset2.filtered.txt"
    output:
        "plots/myplot.pdf"
    script:
        "scripts/myplot.R"

Define workflows

in terms of rules

Dependencies are determined top-down

Solution:

  • For a given target, a rule that can be applied to create it is determined (a job).
  • For the input files of the rule, go on recursively.
  • If no target is specified, Snakemake tries to apply the first rule in the workflow.

Problem:

for a given set of targets, find a composition of rules to create them

Dependencies are determined top-down




rule all:
    input:
        "D1.sorted.txt",
        "D2.sorted.txt",
        "D3.sorted.txt"


rule sort:
    input:
        "path/to/{dataset}.txt"
    output:
        "{dataset}.sorted.txt"
    shell:
        "sort {input} > {output}"

Job 1: apply rule all

(a target rule that just collects results)

Job i: apply rule sort to create i-th input of job 1

Dependencies are determined top-down

DATASETS = ["D1", "D2", "D3"]


rule all:
    input:
        ["{dataset}.sorted.txt".format(dataset=ds)
         for ds in DATASETS]



rule sort:
    input:
        "path/to/{dataset}.txt"
    output:
        "{dataset}.sorted.txt"
    shell:
        "sort {input} > {output}"

Job 1: apply rule all

(a target rule that just collects results)

Job i: apply rule sort to create i-th input of job 1

use arbitrary Python code in your workflow

Dependencies are determined top-down

DATASETS = ["D1", "D2", "D3"]


rule all:
    input:
        expand("{dataset}.sorted.txt", dataset=DATASETS)




rule sort:
    input:
        "path/to/{dataset}.txt"
    output:
        "{dataset}.sorted.txt"
    shell:
        "sort {input} > {output}"

Job 1: apply rule all

(a target rule that just collects results)

Job i: apply rule sort to create i-th input of job 1

use arbitrary Python code in your workflow

Directed acyclic graph (DAG) of jobs

Boilerplate-free integration of scripts

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        "result/{sample}.txt"
    script:
        "scripts/myscript.py"

reusable scripts:

  • Python
  • R
  • Julia
  • Rust
import pandas as pd

data = pd.read_table(snakemake.input[0])
data = data.sort_values("id")
data.to_csv(snakemake.output[0], sep="\t")

Python:

data <- read.table(snakemake@input[[1]])
data <- data[order(data$id),]
write.table(data, file = snakemake@output[[1]])

Boilerplate-free integration of scripts

R:

import polar as pl

pl.read_csv(&snakemake.input[0])
  .sort()
  .to_csv(&snakemake.output[0])

Rust:

Jupyter notebook integration

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        "result/{sample}.txt"
    notebook:
        "notebooks/mynotebook.ipynb"
  1. Integrated interactive edit mode.
  2. Automatic generalization for reuse in other jobs.

Reusable wrappers

rule map_reads:
    input:
        "{sample}.bam"
    output:
        "{sample}.sorted.bam"
    wrapper:
        "0.22.0/bio/samtools/sort"

reuseable wrappers from central repository

Reusable wrappers

Output handling

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        temp("result/{sample}.txt")
    shell:
        "some-tool {input} > {output}"

Output handling

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        protected("result/{sample}.txt")
    shell:
        "some-tool {input} > {output}"

Output handling

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        pipe("result/{sample}.txt")
    shell:
        "some-tool {input} > {output}"

Input functions

Workflows are executed in three phases

  • initialization phase (parsing)
  • DAG phase (DAG is built)
  • scheduling phase (execution of DAG)

Input functions defer determination of input files to the DAG phase

(when wildcard values are known).

Input functions

configfile: "config.yaml"


rule all:
    input:
        expand("{dataset}.sorted.txt", dataset=config["datasets"])


def get_sort_input(wildcards):
    return config["datasets"][wildcards.dataset]


rule sort:
    input:
        get_sort_input
    output:
        "{dataset}.sorted.txt"
    threads: 4
    resources: mem_mb=100
    shell:
        "sort --parallel {threads} {input} > {output}"
  • initialization phase (parsing)
  • DAG phase (DAG is built)
  • scheduling phase (execution of DAG)

Using and combining workflows

configfile: "config/config.yaml"

module dna_seq:
    snakefile:
        "https://github.com/snakemake-workflows/dna-seq-gatk-variant-calling/raw/v2.0.1/Snakefile"
    config:
        config

use rule * from dna_seq
# easily extend the workflow
rule plot_vafs:
    input:
        "filtered/all.vcf.gz"
    output:
        "results/plots/vafs.svg"
    notebook:
        "notebooks/plot-vafs.py.ipynb"

Using and combining workflows

configfile: "config/config.yaml"

module dna_seq:
    snakefile:
        "https://github.com/snakemake-workflows/dna-seq-gatk-variant-calling/raw/v2.0.1/Snakefile"
    config:
        config["dna-seq"]

use rule * from dna_seq as dna_seq_*
# easily extend the workflow
rule plot_vafs:
    input:
        "filtered/all.vcf.gz"
    output:
        "results/plots/vafs.svg"
    notebook:
        "notebooks/plot-vafs.py.ipynb"
module rna_seq:
    snakefile:
        "https://github.com/snakemake-workflows/rna-seq-kallisto-sleuth/raw/v1.0.0/Snakefile"
    config:
        config["rna-seq"]

use rule * from rna_seq as rna_seq_*

Data analysis

  • automation
  • scalability
  • portability
  • readability
  • documentation
  • traceability

Reproducibility

Transparency

  • readability
  • portability
  • scalability

Adaptability

Job execution

A job is executed if and only if

  • output file is target and does not exist
  • output file needed by another executed job and does not exist
  • input file newer than output file
  • input file will be updated by other job
  • any change in the rule (code, params, software stack)
  • execution is enforced

determined via breadth-first-search on DAG of jobs

Resources and priorities

rule sort:
    input:
        "path/to/{dataset}.txt"
    output:
        "{dataset}.sorted.txt"
    priority: 1
    threads: 4
    resources: mem_mb=100
    shell:
        "sort --parallel {threads} {input} > {output}"

refer to defined thread number

define arbitrary additional resources

define used threads

\max U_t \cdot 2S \cdot \sum_{j \in J} x_j \cdot p_j + 2S \cdot \sum_{j \in J} x_j \cdot u_{t,j}\\ + S \cdot \sum_{f \in F} \gamma_f \cdot S_f + \sum_{f \in F} \delta_f \cdot S_f
\sum_{j \in J} x_j \cdot u_{r,j} \leq U_r \quad \forall r \in R
\delta_f \leq \frac{\sum_{j \in J} x_j \cdot z_{f,j}}{\sum_{j \in J} z_{f,j}} \quad\forall f \in F
\text{subject to:}

job selection

job resource usage

free resources

job temp file consumption

temp file lifetime fraction

job priority

job thread usage

Scheduling

temp file size

temp file deletion

\gamma_f \leq \delta_f \quad\forall f \in F
\gamma_f \in \{0,1\}
\delta_f \in [0,1]
x_j \in \{0,1\}

DAG partitioning

--groups a=g1 b=g1
--groups a=g1 b=g1
--group-components g1=2
--groups a=g1 b=g1
--group-components g1=5

Scalable to any platform

workstation

compute server

cluster

grid computing

cloud computing

Between workflow caching

dataset

results

dataset

dataset

dataset

dataset

dataset

shared data

Between workflow caching

Data analysis

  • automation
  • scalability
  • portability
  • readability
  • documentation
  • traceability

Reproducibility

Transparency

  • readability
  • portability
  • scalability

Adaptability

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    conda:
        "envs/some-tool.yaml"
    shell:
        "some-tool {input} > {output}"

Conda integration

channels:
 - conda-forge
dependencies:
  - some-tool =2.3.1
  - some-lib =1.1.2

Container integration

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    container:
        "docker://biocontainers/some-tool#2.3.1"
    shell:
        "some-tool {input} > {output}"

Containerization

containerized:
    "docker://username/myworkflow:1.0.0"


rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    conda:
        "envs/some-tool.yaml"
    shell:
        "some-tool {input} > {output}"
snakemake --containerize > Dockerfile

Data analysis

  • automation
  • scalability
  • portability
  • readability
  • documentation
  • traceability

Reproducibility

Transparency

  • readability
  • portability
  • scalability

Adaptability

Self-contained HTML reports

Snakemake workflow catalog

Conclusion

  • Besides reproducibility, transparency and adaptability are equally important.
  • Almost never, research can be done in a single big script or package.
  • Snakemake helps to avoid boilerplate for deployment, data provenance and scalability, and make analyses more structured.
  • Snakemake further allows reporting with minimal effort.

https://snakemake.github.io

Best practices

├── .gitignore
├── README.md
├── LICENSE.md
├── workflow
│   ├── resources
|   │   └── someresouce.bed
│   ├── rules
|   │   ├── some_task.smk
|   │   └── some_other_task.smk
│   ├── envs
|   │   ├── tool1.yaml
|   │   └── tool2.yaml
│   ├── scripts
|   │   ├── script1.py
|   │   └── script2.R
│   ├── notebooks
|   │   ├── notebook1.py.ipynb
|   │   └── notebook2.r.ipynb
│   ├── report
|   │   ├── plot1.rst
|   │   └── plot2.rst
|   └── Snakefile
├── config
│   ├── config.yaml
│   └── some-sheet.tsv
├── results
└── resources

Stick to a standardized structure

Goal:

Help readers to navigate to those parts they are interested in.

 

  • separate config, code, resources, and results
  • have separate folders for separate parts of the source
  • partition workflow into larger tasks which are performed by sets of rules, have a separate file per task

Code readability

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    log:
        "logs/mytask/{dataset}.log"
    conda:
      	"envs/myenv.yaml"
    shell:
        "some-tool {input} > {output} 2> {log}"
  • simple, short, descriptive names
  • reasonable directory structure
  • use shell for short commands, use script and notebook for complex stuff
  • put non-rule code (helper functions, config parsing) into separate module, e.g. workflow/rules/common.smk
  • conda/container for every rule
  • prefer small envs over large ones (maintainability, performance)
  • define log files
  • think about which rules' outputs shall be cached across workflows

Snakemake lecture

By Johannes Köster