Transparency, reproducibility and the democratization of an ecosystem - the benefits of Snakemake 8

Johannes Köster

 

2023

dataset

results

Data analysis

"Let me do that by hand..."

dataset

results

dataset

dataset

dataset

dataset

dataset

"Let me do that by hand..."

Data analysis

>970k downloads since 2015

>2700 citations

>11 citations per week in 2023

dataset

results

dataset

dataset

dataset

dataset

dataset

Define workflows

in terms of rules

Define workflows

in terms of rules

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    script:
        "scripts/myscript.R"


rule myfiltration:
     input:
        "result/{dataset}.txt"
     output:
        "result/{dataset}.filtered.txt"
     shell:
        "mycommand {input} > {output}"


rule aggregate:
    input:
        "results/dataset1.filtered.txt",
        "results/dataset2.filtered.txt"
    output:
        "plots/myplot.pdf"
    script:
        "scripts/myplot.R"

Define workflows

in terms of rules

Define workflows

in terms of rules

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        "result/{sample}.txt"
    shell:
        "some-tool {input} > {output}"

rule name

how to create output from input

define

  • input
  • output
  • log files
  • parameters
  • resources
rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    script:
        "scripts/myscript.R"


rule myfiltration:
     input:
        "result/{dataset}.txt"
     output:
        "result/{dataset}.filtered.txt"
     shell:
        "mycommand {input} > {output}"


rule aggregate:
    input:
        "results/dataset1.filtered.txt",
        "results/dataset2.filtered.txt"
    output:
        "plots/myplot.pdf"
    script:
        "scripts/myplot.R"

Define workflows

in terms of rules

Boilerplate-free integration of scripts

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        "result/{sample}.txt"
    script:
        "scripts/myscript.py"

reusable scripts:

  • Python
  • R
  • Julia
  • Rust
  • Bash
import pandas as pd

data = pd.read_table(snakemake.input[0])
data = data.sort_values("id")
data.to_csv(snakemake.output[0], sep="\t")

Python:

data <- read.table(snakemake@input[[1]])
data <- data[order(data$id),]
write.table(data, file = snakemake@output[[1]])

Boilerplate-free integration of scripts

R:

import polar as pl

pl.read_csv(&snakemake.input[0])
  .sort()
  .to_csv(&snakemake.output[0])

Rust:

Jupyter notebook integration

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        "result/{sample}.txt"
    notebook:
        "notebooks/mynotebook.ipynb"
  1. Integrated interactive edit mode.
  2. Automatic generalization for reuse in other jobs.

Reusable wrappers

rule map_reads:
    input:
        "{sample}.bam"
    output:
        "{sample}.sorted.bam"
    wrapper:
        "0.22.0/bio/samtools/sort"

reuseable wrappers from central repository

Reusable wrappers

Output handling

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        temp("result/{sample}.txt")
    shell:
        "some-tool {input} > {output}"

Output handling

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        protected("result/{sample}.txt")
    shell:
        "some-tool {input} > {output}"

Output handling

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        pipe("result/{sample}.txt")
    shell:
        "some-tool {input} > {output}"

New in Snakemake 8: semantic helper functions

rule mytask:
    input:
        "data/counts.tsv"
    output:
        "results/diffexp/{model}.tsv"
    params:
        fdr=lookup(dpath="diffexp/{model}/fdr", within=config)
    script:
        "scripts/diffexp.R"

New in Snakemake 8: semantic helper functions

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        "results/{sample}.txt"
    params:
        sample_props=lookup(query="sample == {sample}", within=samples)
    script:
        "scripts/somestep.py"

New in Snakemake 8: semantic helper functions

rule mytask:
    input:
        branch(
            lookup(dpath="switches/mark_duplicates", within=config),
            then="results/mapped/{sample}.bam",
            otherwise="results/mapped/{sample}.rmdup.bam",
        )
    output:
        "results/{sample}.txt"
    script:
        "scripts/somestep.py"

Using and combining workflows

configfile: "config/config.yaml"

module dna_seq:
    snakefile:
        github("snakemake-workflows/dna-seq-gatk-variant-calling", path="workflow/Snakefile", tag="v1.17.0")
    config:
        config

use rule * from dna_seq
# easily extend the workflow
rule plot_vafs:
    input:
        "filtered/all.vcf.gz"
    output:
        "results/plots/vafs.svg"
    notebook:
        "notebooks/plot-vafs.py.ipynb"

Using and combining workflows

configfile: "config/config.yaml"

module dna_seq:
    snakefile:
        github("snakemake-workflows/dna-seq-gatk-variant-calling", path="workflow/Snakefile", tag="v1.17.0")
    config:
        config["dna-seq"]

use rule * from dna_seq as dna_seq_*
# easily extend the workflow
rule plot_vafs:
    input:
        "filtered/all.vcf.gz"
    output:
        "results/plots/vafs.svg"
    notebook:
        "notebooks/plot-vafs.py.ipynb"
module rna_seq:
    snakefile:
        github("snakemake-workflows/rna-seq-kallisto-sleuth", path="workflow/Snakefile", tag="v1.0.0")
    config:
        config["rna-seq"]

use rule * from rna_seq as rna_seq_*
\max U_t \cdot 2S \cdot \sum_{j \in J} x_j \cdot p_j + 2S \cdot \sum_{j \in J} x_j \cdot (u_{t,j}) + S \cdot \sum_{f \in F} \gamma_f \cdot S_f\\ + \sum_{f \in F} \delta_f \cdot S_f
\sum_{j \in J} x_j \cdot u_{r,j} \leq U_r \quad \forall r \in R
\delta_f \leq \frac{\sum_{j \in J} x_j \cdot z_{f,j}}{\sum_{j \in J} z_{f,j}} \quad\forall f \in F
\text{subject to:}

job selection

job resource usage

free resources

job temp file consumption

temp file lifetime fraction

job priority

job thread usage

Scheduling

temp file size

temp file deletion

\gamma_f \leq \delta_f \quad\forall f \in F
\gamma_f \in \{0,1\}
\delta_f \in [0,1]
x_f \in \{0,1\}

DAG partitioning

--groups a=g1 b=g1
--groups a=g1 b=g1
--group-components g1=2
--groups a=g1 b=g1
--group-components g1=5

Scalable to any platform

workstation

compute server

cluster

grid computing

cloud computing

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    conda:
        "envs/some-tool.yaml"
    shell:
        "some-tool {input} > {output}"

Conda integration

channels:
 - conda-forge
dependencies:
  - some-tool =2.3.1
  - some-lib =1.1.2

Container integration

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    container:
        "docker://biocontainers/some-tool#2.3.1"
    shell:
        "some-tool {input} > {output}"

Self-contained HTML reports

New in Snakemake 8:

a plugin architecture

Datavzrd: low-code

interactive result views

Input:

  • set of tables
  • relations between tables
  • set of rendering definitions

Output:

  • portable interactive visual presentation
  • static printable version
name: My oscar report
default-view: oscars

datasets:
  oscars:
    path: "data/oscars.csv"
    links:
      link to oscar plot:
        column: age
        view: oscar-plot
      link to movie:
        column: movie
        table-row: movies/Title

  movies:
    path: "data/movies.csv"
    links:
      link to oscar entry:
        column: Title
        table-row: oscars/movie

views:
  oscars:
    dataset: oscars
    desc: |
      ### All winning oscars beginning in the year 1929.
      This table contains *all* winning oscars for best actress and best actor.
    page-size: 25
    render-table:
      columns:
        age:
          plot:
            ticks:
              scale: linear
              domain:
                - 20
                - 100
        name:
          link-to-url: "https://lmgtfy.app/?q=Is {name} in {movie}?"
        movie:
          link-to-url: "https://de.wikipedia.org/wiki/{value}"
        award:
          plot:
            heatmap:
              scale: ordinal
              domain:
                - Best actor
                - Best actress
              range:
                - "#add8e6"
                - "#ffb6c1"
        index(0):
          display-mode: hidden
        regex('birth_.+'):
          display-mode: detail

  movies:
    dataset: movies
    render-table:
      columns:
        Genre:
          ellipsis: 15
        imdbID:
          link-to-url: "https://www.imdb.com/title/{value}/"
        Title:
          link-to-url: "https://de.wikipedia.org/wiki/{value}"
        imdbRating:
          precision: 1
          plot:
            bars:
              scale: linear
              domain:
                - 1
                - 10
        Rated:
          plot-view-legend: true
          plot:
            heatmap:
              scale: ordinal
              color-scheme: accent

  oscar-plot:
    dataset: oscars
    desc: |
      ## My beautiful oscar scatter plot
      *So many great actors and actresses*
    render-plot:
      spec-path: ".examples/specs/oscars.vl.json"

  movies-plot:
    dataset: movies
    desc: |
      All movies with its *runtime* and *ratings* plotted over *time*.
    render-plot:
      spec-path: ".examples/specs/movies.vl.json"

+

Snakemake workflow catalog

Conclusion

Snakemake covers all aspects of fully reproducible, transparent, and adaptable data analysis.

Datavzrd can be used to rapidly obtain visual interactive interfaces to tabular results.

https://snakemake.github.io

https://github.com/datavzrd/datavzrd

Transparency, reproducibility and the democratization of an ecosystem - the benefits of Snakemake 8

By Johannes Köster

Transparency, reproducibility and the democratization of an ecosystem - the benefits of Snakemake 8

  • 90