Reproducible, transparent, and adaptable data analysis with Snakemake and Datavzrd

Johannes Köster

 

2023

dataset

results

Data analysis

"Let me do that by hand..."

dataset

results

dataset

dataset

dataset

dataset

dataset

"Let me do that by hand..."

Data analysis

  • check computational validity
  • apply same analysis to new data
  • check methodological validity
  • understand analysis

Data analysis

Reproducibility

Transparency

  • modify analysis
  • extend analysis

Adaptability

>700k downloads since 2015

>2000 citations

>10 citations per week in 2022

  • automation
  • scalability
  • portability
  • readability
  • documentation
  • traceability

Data analysis

Reproducibility

Transparency

  • readability
  • portability
  • scalability

Adaptability

Data analysis

  • automation
  • scalability
  • portability
  • readability
  • documentation
  • traceability

Reproducibility

Transparency

  • readability
  • portability
  • scalability

Adaptability

dataset

results

dataset

dataset

dataset

dataset

dataset

Define workflows

in terms of rules

Define workflows

in terms of rules

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    script:
        "scripts/myscript.R"


rule myfiltration:
     input:
        "result/{dataset}.txt"
     output:
        "result/{dataset}.filtered.txt"
     shell:
        "mycommand {input} > {output}"


rule aggregate:
    input:
        "results/dataset1.filtered.txt",
        "results/dataset2.filtered.txt"
    output:
        "plots/myplot.pdf"
    script:
        "scripts/myplot.R"

Define workflows

in terms of rules

Define workflows

in terms of rules

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        "result/{sample}.txt"
    shell:
        "some-tool {input} > {output}"

rule name

how to create output from input

define

  • input
  • output
  • log files
  • parameters
  • resources
rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    script:
        "scripts/myscript.R"


rule myfiltration:
     input:
        "result/{dataset}.txt"
     output:
        "result/{dataset}.filtered.txt"
     shell:
        "mycommand {input} > {output}"


rule aggregate:
    input:
        "results/dataset1.filtered.txt",
        "results/dataset2.filtered.txt"
    output:
        "plots/myplot.pdf"
    script:
        "scripts/myplot.R"

Define workflows

in terms of rules

Boilerplate-free integration of scripts

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        "result/{sample}.txt"
    script:
        "scripts/myscript.py"

reusable scripts:

  • Python
  • R
  • Julia
  • Rust
  • Bash
import pandas as pd

data = pd.read_table(snakemake.input[0])
data = data.sort_values("id")
data.to_csv(snakemake.output[0], sep="\t")

Python:

data <- read.table(snakemake@input[[1]])
data <- data[order(data$id),]
write.table(data, file = snakemake@output[[1]])

Boilerplate-free integration of scripts

R:

import polar as pl

pl.read_csv(&snakemake.input[0])
  .sort()
  .to_csv(&snakemake.output[0])

Rust:

Jupyter notebook integration

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        "result/{sample}.txt"
    notebook:
        "notebooks/mynotebook.ipynb"
  1. Integrated interactive edit mode.
  2. Automatic generalization for reuse in other jobs.

Reusable wrappers

rule map_reads:
    input:
        "{sample}.bam"
    output:
        "{sample}.sorted.bam"
    wrapper:
        "0.22.0/bio/samtools/sort"

reuseable wrappers from central repository

Reusable wrappers

Output handling

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        temp("result/{sample}.txt")
    shell:
        "some-tool {input} > {output}"

Output handling

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        protected("result/{sample}.txt")
    shell:
        "some-tool {input} > {output}"

Output handling

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        pipe("result/{sample}.txt")
    shell:
        "some-tool {input} > {output}"

Using and combining workflows

configfile: "config/config.yaml"

module dna_seq:
    snakefile:
        github("snakemake-workflows/dna-seq-gatk-variant-calling", path="workflow/Snakefile", tag="v1.17.0")
    config:
        config

use rule * from dna_seq
# easily extend the workflow
rule plot_vafs:
    input:
        "filtered/all.vcf.gz"
    output:
        "results/plots/vafs.svg"
    notebook:
        "notebooks/plot-vafs.py.ipynb"

Using and combining workflows

configfile: "config/config.yaml"

module dna_seq:
    snakefile:
        github("snakemake-workflows/dna-seq-gatk-variant-calling", path="workflow/Snakefile", tag="v1.17.0")
    config:
        config["dna-seq"]

use rule * from dna_seq as dna_seq_*
# easily extend the workflow
rule plot_vafs:
    input:
        "filtered/all.vcf.gz"
    output:
        "results/plots/vafs.svg"
    notebook:
        "notebooks/plot-vafs.py.ipynb"
module rna_seq:
    snakefile:
        github("snakemake-workflows/rna-seq-kallisto-sleuth", path="workflow/Snakefile", tag="v1.0.0")
    config:
        config["rna-seq"]

use rule * from rna_seq as rna_seq_*

Data analysis

  • automation
  • scalability
  • portability
  • readability
  • documentation
  • traceability

Reproducibility

Transparency

  • readability
  • portability
  • scalability

Adaptability

\max U_t \cdot 2S \cdot \sum_{j \in J} x_j \cdot p_j + 2S \cdot \sum_{j \in J} x_j \cdot (u_{t,j}) + S \cdot \sum_{f \in F} \gamma_f \cdot S_f\\ + \sum_{f \in F} \delta_f \cdot S_f
\sum_{j \in J} x_j \cdot u_{r,j} \leq U_r \quad \forall r \in R
\delta_f \leq \frac{\sum_{j \in J} x_j \cdot z_{f,j}}{\sum_{j \in J} z_{f,j}} \quad\forall f \in F
\text{subject to:}

job selection

job resource usage

free resources

job temp file consumption

temp file lifetime fraction

job priority

job thread usage

Scheduling

temp file size

temp file deletion

\gamma_f \leq \delta_f \quad\forall f \in F
\gamma_f \in \{0,1\}
\delta_f \in [0,1]
x_f \in \{0,1\}

DAG partitioning

--groups a=g1 b=g1
--groups a=g1 b=g1
--group-components g1=2
--groups a=g1 b=g1
--group-components g1=5

Scalable to any platform

workstation

compute server

cluster

grid computing

cloud computing

Between workflow caching

dataset

results

dataset

dataset

dataset

dataset

dataset

shared data

Between workflow caching

Data analysis

  • automation
  • scalability
  • portability
  • readability
  • documentation
  • traceability

Reproducibility

Transparency

  • readability
  • portability
  • scalability

Adaptability

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    conda:
        "envs/some-tool.yaml"
    shell:
        "some-tool {input} > {output}"

Conda integration

channels:
 - conda-forge
dependencies:
  - some-tool =2.3.1
  - some-lib =1.1.2

Container integration

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    container:
        "docker://biocontainers/some-tool#2.3.1"
    shell:
        "some-tool {input} > {output}"

Containerization

containerized:
    "docker://username/myworkflow:1.0.0"


rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    conda:
        "envs/some-tool.yaml"
    shell:
        "some-tool {input} > {output}"
snakemake --containerize > Dockerfile

Data analysis

  • automation
  • scalability
  • portability
  • readability
  • documentation
  • traceability

Reproducibility

Transparency

  • readability
  • portability
  • scalability

Adaptability

Self-contained HTML reports

Datavzrd: low-code

interactive result views

Input:

  • set of tables
  • relations between tables
  • set of rendering definitions

Output:

  • portable interactive visual presentation
  • static printable version
name: My oscar report
default-view: oscars

datasets:
  oscars:
    path: "data/oscars.csv"
    links:
      link to oscar plot:
        column: age
        view: oscar-plot
      link to movie:
        column: movie
        table-row: movies/Title

  movies:
    path: "data/movies.csv"
    links:
      link to oscar entry:
        column: Title
        table-row: oscars/movie

views:
  oscars:
    dataset: oscars
    desc: |
      ### All winning oscars beginning in the year 1929.
      This table contains *all* winning oscars for best actress and best actor.
    page-size: 25
    render-table:
      columns:
        age:
          plot:
            ticks:
              scale: linear
              domain:
                - 20
                - 100
        name:
          link-to-url: "https://lmgtfy.app/?q=Is {name} in {movie}?"
        movie:
          link-to-url: "https://de.wikipedia.org/wiki/{value}"
        award:
          plot:
            heatmap:
              scale: ordinal
              domain:
                - Best actor
                - Best actress
              range:
                - "#add8e6"
                - "#ffb6c1"
        index(0):
          display-mode: hidden
        regex('birth_.+'):
          display-mode: detail

  movies:
    dataset: movies
    render-table:
      columns:
        Genre:
          ellipsis: 15
        imdbID:
          link-to-url: "https://www.imdb.com/title/{value}/"
        Title:
          link-to-url: "https://de.wikipedia.org/wiki/{value}"
        imdbRating:
          precision: 1
          plot:
            bars:
              scale: linear
              domain:
                - 1
                - 10
        Rated:
          plot-view-legend: true
          plot:
            heatmap:
              scale: ordinal
              color-scheme: accent

  oscar-plot:
    dataset: oscars
    desc: |
      ## My beautiful oscar scatter plot
      *So many great actors and actresses*
    render-plot:
      spec-path: ".examples/specs/oscars.vl.json"

  movies-plot:
    dataset: movies
    desc: |
      All movies with its *runtime* and *ratings* plotted over *time*.
    render-plot:
      spec-path: ".examples/specs/movies.vl.json"

+

Snakemake workflow catalog

Conclusion

Snakemake covers all aspects of fully reproducible, transparent, and adaptable data analysis.

Datavzrd can be used to rapidly obtain visual interactive interfaces to tabular results.

https://snakemake.github.io

https://github.com/datavzrd/datavzrd

Snakemake 8.0

transforming towards a plugin architecture

  • executors
  • remote storage
  • scheduler
  • script/notebook support
  • DAG inference

 

backed by

  • stable interface packages with independent versioning

 

developed by

  • community
  • individuals
  • companies

Snakemake 8.0

language updates

  • decouple language version from Snakemake runtime
  • allow specification of language version in the workflow definition
  • allow specification of auxilliary workflow dependencies in the Snakefile
  • introduce semantic helper functions for branching, collecting, value lookups, evaluation of wildcards
    • more self contained rules
    • much less Python code
    • increased readability
import pandas as pd

samples = pd.read_csv("samples.tsv", sep="\t")


configfile: "config.yaml"


rule all:
    input:
        "results/switch~someswitch.column~sample.txt",


rule a:
    output:
        "a/{sample}.txt",
    shell:
        "echo a > {output}"


rule b:
    input:
        branch(evaluate("{sample} == '100'"), then="a/{sample}.txt"),
    output:
        "b/{sample}.txt",
    shell:
        "echo b > {output}"


rule c:
    input:
        branch(
            evaluate("{sample} == '1'"),
            then="a/{sample}.txt",
            otherwise="b/{sample}.txt",
        ),
    output:
        "c/{sample}.txt",
    shell:
        "cat {input} > {output}"


rule d:
    output:
        "test.txt",
    shell:
        "echo d > {output}"


rule e:
    input:
        collect("c/{item.sample}.txt", item=lookup(query="{col} <= 2", within=samples)),
        branch(lookup(dpath="switches/{switch}", within=config), then="test.txt"),
    output:
        "results/switch~{switch}.column~{col}.txt",
    shell:
        "cat {input} > {output}"

Sustainable data analysis with Snakemake and Datavzrd

By Johannes Köster

Sustainable data analysis with Snakemake and Datavzrd

Non-bio-related talk about Snakemake

  • 299