Transparency, reproducibility and the democratization of an ecosystem - the benefits of Snakemake 8

Johannes Köster

 

2024

dataset

results

Data analysis

"Let me do that by hand..."

dataset

results

dataset

dataset

dataset

dataset

dataset

"Let me do that by hand..."

Data analysis

  • check computational validity
  • apply same analysis to new data
  • check methodological validity
  • understand analysis

Data analysis

Reproducibility

Transparency

  • modify analysis
  • extend analysis

Adaptability

>1 million downloads since 2015

>2700 citations

>11 citations per week in 2023

  • automation
  • scalability
  • portability
  • readability
  • documentation
  • traceability

Data analysis

Reproducibility

Transparency

  • readability
  • portability
  • scalability

Adaptability

Data analysis

  • automation
  • scalability
  • portability
  • readability
  • documentation
  • traceability

Reproducibility

Transparency

  • readability
  • portability
  • scalability

Adaptability

dataset

results

dataset

dataset

dataset

dataset

dataset

Define workflows

in terms of rules

Define workflows

in terms of rules

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    script:
        "scripts/myscript.R"


rule myfiltration:
     input:
        "result/{dataset}.txt"
     output:
        "result/{dataset}.filtered.txt"
     shell:
        "mycommand {input} > {output}"


rule aggregate:
    input:
        "results/dataset1.filtered.txt",
        "results/dataset2.filtered.txt"
    output:
        "plots/myplot.pdf"
    script:
        "scripts/myplot.R"

Define workflows

in terms of rules

Define workflows

in terms of rules

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        "result/{sample}.txt"
    shell:
        "some-tool {input} > {output}"

rule name

how to create output from input

define

  • input
  • output
  • log files
  • parameters
  • resources
rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    script:
        "scripts/myscript.R"


rule myfiltration:
     input:
        "result/{dataset}.txt"
     output:
        "result/{dataset}.filtered.txt"
     shell:
        "mycommand {input} > {output}"


rule aggregate:
    input:
        "results/dataset1.filtered.txt",
        "results/dataset2.filtered.txt"
    output:
        "plots/myplot.pdf"
    script:
        "scripts/myplot.R"

Automatic inference of DAG of jobs

Boilerplate-free integration of scripts

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        "result/{sample}.txt"
    script:
        "scripts/myscript.py"

reusable scripts:

  • Python
  • R
  • Julia
  • Rust
  • Bash
import pandas as pd

data = pd.read_table(snakemake.input[0])
data = data.sort_values("id")
data.to_csv(snakemake.output[0], sep="\t")

Python:

data <- read.table(snakemake@input[[1]])
data <- data[order(data$id),]
write.table(data, file = snakemake@output[[1]])

Boilerplate-free integration of scripts

R:

import polar as pl

pl.read_csv(&snakemake.input[0])
  .sort()
  .to_csv(&snakemake.output[0])

Rust:

Reusable wrappers

rule map_reads:
    input:
        "{sample}.bam"
    output:
        "{sample}.sorted.bam"
    wrapper:
        "0.22.0/bio/samtools/sort"

reuseable wrappers from central repository

Reusable wrappers

Data analysis

  • automation
  • scalability
  • portability
  • readability
  • documentation
  • traceability

Reproducibility

Transparency

  • readability
  • portability
  • scalability

Adaptability

\max U_t \cdot 2S \cdot \sum_{j \in J} x_j \cdot p_j + 2S \cdot \sum_{j \in J} x_j \cdot (u_{t,j}) + S \cdot \sum_{f \in F} \gamma_f \cdot S_f\\ + \sum_{f \in F} \delta_f \cdot S_f
\sum_{j \in J} x_j \cdot u_{r,j} \leq U_r \quad \forall r \in R
\delta_f \leq \frac{\sum_{j \in J} x_j \cdot z_{f,j}}{\sum_{j \in J} z_{f,j}} \quad\forall f \in F
\text{subject to:}

job selection

job resource usage

free resources

job temp file consumption

temp file lifetime fraction

job priority

job thread usage

Scheduling

temp file size

temp file deletion

\gamma_f \leq \delta_f \quad\forall f \in F
\gamma_f \in \{0,1\}
\delta_f \in [0,1]
x_f \in \{0,1\}

DAG partitioning

--groups a=g1 b=g1
--groups a=g1 b=g1
--group-components g1=2
--groups a=g1 b=g1
--group-components g1=5

Scalable to any platform

workstation

compute server

cluster

grid computing

cloud computing

Data analysis

  • automation
  • scalability
  • portability
  • readability
  • documentation
  • traceability

Reproducibility

Transparency

  • readability
  • portability
  • scalability

Adaptability

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    conda:
        "envs/some-tool.yaml"
    shell:
        "some-tool {input} > {output}"

Conda integration

channels:
 - conda-forge
dependencies:
  - some-tool =2.3.1
  - some-lib =1.1.2

Container integration

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    container:
        "docker://biocontainers/some-tool#2.3.1"
    shell:
        "some-tool {input} > {output}"

Data analysis

  • automation
  • scalability
  • portability
  • readability
  • documentation
  • traceability

Reproducibility

Transparency

  • readability
  • portability
  • scalability

Adaptability

Self-contained HTML reports

Many more features

  • dynamic DAG rewiring
  • service jobs (providing sockets, loading databases, or ramdisks)
  • semantic helper functions for minimizing boilerplate code
  • fallible rules
  • caching of shared results across workflows
  • transparent handling of remote storage
  • interoperability (CWL tool wrappers, integration of Nextflow workflows)

Extensible architecture

Extensible architecture

Language readability

Lookup in config: so far

def get_threshold(wildcards):
    return config["some_tool"]["thresholds"].get(wildcards.dataset, 0.1)


rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    params:
        some_threshold=get_threshold
    shell:
        "some-tool {input} > {output}"

Lookup in config: now





rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    params:
        some_threshold=lookup(
            dpath="some_tool/thresholds/{dataset}",
            within=config,
            default=0.1
        )
    shell:
        "some-tool {input} > {output}"

Lookup in sheet: so far

def get_threshold(wildcards):
    return sheet.loc[wildcards.dataset, "threshold"]


rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    params:
        some_threshold=get_threshold
    shell:
        "some-tool {input} > {output}"

Lookup in sheet: now





rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    params:
        some_threshold=lookup(
            query="dataset == '{dataset}'",
            cols="threshold",
            within=sheet
        )
    shell:
        "some-tool {input} > {output}"

Branching: so far

def get_mytask_input(wildcards):
    return "results/preprocessed/{dataset}" if config["prefilter"]["activate"] else "path/to/{dataset}"


rule mytask:
    input:
        get_mytask_input
    output:
        "result/{dataset}.txt"
    shell:
        "some-tool {input} > {output}"

Branching: now





rule mytask:
    input:
        branch(
            lookup(dpath="prefilter/activate", within=config),
            then="results/preprocessed/{dataset}",
            otherwise="path/to/{dataset}"
        )
    output:
        "result/{dataset}.txt"
    shell:
        "some-tool {input} > {output}"

Snakemake workflow catalog

Conclusion

Snakemake covers all aspects of fully reproducible, transparent, and adaptable data analysis, offering

  • maximum readability
  • ad-hoc integration with scripting and high performance languages
  • an extensible architecture
  • a plethora of advanced features

https://snakemake.github.io

snakemake-intro-updates-2024

By Johannes Köster

snakemake-intro-updates-2024

  • 215