Generic solutions for specific problems: Snakemake, Datavzrd, Bioconda,
and Varlociraptor in 2024

Johannes Köster

GCB 2024

Snakemake

>1 million downloads since 2015

>3000 citations

>12 citations per week in 2023

affiliated

Mölder, F., Jablonski, K.P., Letcher, B., Hall, M.B., Tomkins-Tinch, C.H., Sochat, V., Forster, J., Lee, S., Twardziok, S.O., Kanitz, A., Wilm, A., Holtgrewe, M., Rahmann, S., Nahnsen, S., Köster, J. Sustainable data analysis with Snakemake. F1000Res 10, 33 (2021).

https://snakemake.github.io

dataset

results

dataset

https://snakemake.github.io

Define workflows

in terms of rules

https://snakemake.github.io

Define workflows

in terms of rules

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    script:
        "scripts/myscript.R"


rule myfiltration:
     input:
        "result/{dataset}.txt"
     output:
        "result/{dataset}.filtered.txt"
     shell:
        "mycommand {input} > {output}"


rule aggregate:
    input:
        "results/dataset1.filtered.txt",
        "results/dataset2.filtered.txt"
    output:
        "plots/myplot.pdf"
    script:
        "scripts/myplot.R"

https://snakemake.github.io

Define workflows

in terms of rules

Define workflows

in terms of rules

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        "result/{sample}.txt"
    params:
        someparam=0.1
    conda:
        "envs/sometool.yaml"
    shell:
        "some-tool {params.someparam} {input} > {output}"

simple paths
helpers for branching
and aggregation
Python logic

shell commands
ad-hoc scripts and notebooks (Python, R, Julia, Rust, Bash)
wrappers

https://snakemake.github.io

deployment via

conda, containers

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    script:
        "scripts/myscript.R"


rule myfiltration:
     input:
        "result/{dataset}.txt"
     output:
        "result/{dataset}.filtered.txt"
     shell:
        "mycommand {input} > {output}"


rule aggregate:
    input:
        "results/dataset1.filtered.txt",
        "results/dataset2.filtered.txt"
    output:
        "plots/myplot.pdf"
    script:
        "scripts/myplot.R"

Automatic inference of DAG of jobs

https://snakemake.github.io

Scalable to any platform

workstation

compute server

cluster

grid computing

cloud computing

https://snakemake.github.io

Self-contained HTML reports

https://snakemake.github.io

Snakemake 8: plugin architecture

https://snakemake.github.io

Simplicity and automation

Object oriented implementation of a few abstract methods

class Executor(RemoteExecutor):
    def run_job(self, job: JobExecutorInterface):
        ...

    async def check_active_jobs(
        self, active_jobs: List[SubmittedJobInfo]
    ) -> Generator[SubmittedJobInfo, None, None]:
        ...

    def cancel_jobs(self, active_jobs: List[SubmittedJobInfo]):
       ...

https://snakemake.github.io

Simplicity and automation

Dataclass for CLI arg registry

@dataclass
class ExecutorSettings(ExecutorSettingsBase):
    namespace: str = field(
        default="default", metadata={"help": "The namespace to use for submitted jobs."}
    )
    cpu_scalar: float = field(
        default=0.95,
        metadata={
            "help": "K8s reserves some proportion of available CPUs for its own use. "
            "So, where an underlying node may have 8 CPUs, only e.g. 7600 milliCPUs "
            "are allocatable to k8s pods (i.e. snakemake jobs). As 8 > 7.6, k8s can't "
            "find a node with enough CPU resource to run such jobs. This argument acts "
            "as a global scalar on each job's CPU request, so that e.g. a job whose "
            "rule definition asks for 8 CPUs will request 7600m CPUs from k8s, "
            "allowing it to utilise one entire node. N.B: the job itself would still "
            "see the original value, i.e. as the value substituted in {threads}."
        },
    )
    ...

https://snakemake.github.io

Simplicity and automation

Poetry template for skeleton code

# Install poetry plugin via
poetry self add poetry-snakemake-plugin

# Create a new poetry project via
poetry new snakemake-executor-plugin-myfancyexecutor

cd snakemake-executor-plugin-myfancyexecutor

# Scaffold the project as a snakemake executor plugin
poetry scaffold-snakemake-executor-plugin

https://snakemake.github.io

Simplicity and automation

Unified test suites for each plugin type

class TestWorkflows(snakemake.common.tests.TestWorkflowsLocalStorageBase):
    __test__ = True

    def get_executor(self) -> str:
        return "slurm"

    def get_executor_settings(self) -> Optional[ExecutorSettingsBase]:
        return ExecutorSettings()


class TestWorkflowsRequeue(TestWorkflows):
    def get_executor_settings(self) -> Optional[ExecutorSettingsBase]:
        return ExecutorSettings(requeue=True)

https://snakemake.github.io

Simplicity and automation

Automatic documentation

https://snakemake.github.io

Bioconda

Software installation is heterogeneous

source("https://bioconductor.org/biocLite.R")
biocLite("DESeq2")

easy_install snakemake

./configure --prefix=/usr/local
make
make install

cp lib/amd64/jli/*.so lib
cp lib/amd64/*.so lib
cp * $PREFIX

cpan -i bioperl

cmake ../../my_project \
    -DCMAKE_MODULE_PATH=~/devel/seqan/util/cmake \
    -DSEQAN_INCLUDE_PATH=~/devel/seqan/include
make
make install

apt-get install bwa

yum install python-h5py

install.packages("matrixpls")

https://bioconda.github.io

Package management with Conda

package:
  name: seqtk
  version: 1.2

source:
  fn: v1.2.tar.gz
  url: https://github.com/lh3/seqtk/archive/v1.2.tar.gz

requirements:
  build:
    - gcc
    - zlib
  run:
    - zlib

about:
  home: https://github.com/lh3/seqtk
  license: MIT License
  summary: Seqtk is a fast and lightweight tool for processing sequences

test:
  commands:
    - seqtk seq

#!/bin/bash

export C_INCLUDE_PATH=${PREFIX}/include
export LIBRARY_PATH=${PREFIX}/lib

make all
mkdir -p $PREFIX/bin
cp seqtk $PREFIX/bin

source or binary
recipe and build script

package

install into isolated environments without admin permissions

https://bioconda.github.io

The Bioconda Project

B Grüning, R Dale, A Sjödin, BA Chapman, J Rowe, CH Tomkins-Tinch, R Valieris, ..., J Köster.
Bioconda: Sustainable and Comprehensive Software Distribution for the Life Sciences. Nature Methods 2018.

https://bioconda.github.io

a Conda channel for the life sciences

The Bioconda Project

https://bioconda.github.io

>52000

commits

>48000

pull requests

>10800

packages

ISMB Proceedings 2023

Software

Scripts

https://bioconda.github.io

Recently added:

ARM support (Apple Silicon and Linux)

Potentially up next:

WASM support

https://bioconda.github.io

Datavzrd

Tables are the central entity
in data analysis

https://datavzrd.github.io

Not always a single table

oncoprint + individual variant calls

differentially expressed genes + expression matrix

https://datavzrd.github.io

Not always just a table

oncoprint + individual variant calls +

differentially expressed genes + expression matrix +

https://datavzrd.github.io

State of the art

Individual tables (tsv, excel) and plots:

easy to publish
limited interactivity
no jumping between corresponding items

Web applications (custom, shiny, ...):

running server (or local installation)
implementation overhead
long-term maintenance is challenging

https://datavzrd.github.io

The problem

Input:

set of tables
relations between tables
set of rendering definitions

Output:

portable interactive visual presentation

https://datavzrd.github.io

datasets:
  oscars:
    path: "data/oscars.csv"
    links:
      link to oscar plot:
        column: age
        view: oscar-plot
      link to movie:
        column: movie
        table-row: movies/Title

  movies:
    path: "data/movies.csv"
    links:
      link to oscar entry:
        column: Title
        table-row: oscars/movie

views:
  oscars:
    dataset: oscars
    desc: |
      ### All winning oscars beginning in the year 1929.
      This table contains *all* winning oscars for best
      actress and best actor.
    page-size: 25
    render-table:
      columns:
        age:
          plot:
            ticks:
              scale: linear
              domain:
                - 20
                - 100
        name:
          link-to-url: "https://lmgtfy.app/?q=Is {name} in {movie}?"
        movie:
          link-to-url: "https://de.wikipedia.org/wiki/{value}"
        award:
          plot:
            heatmap:
              scale: ordinal
              domain:
                - Best actor
                - Best actress
              range:
                - "#add8e6"
                - "#ffb6c1"
        index(0):
          display-mode: hidden
        regex('birth_.+'):
          display-mode: detail

  movies:
    dataset: movies
    render-table:
      columns:
        Genre:
          ellipsis: 15
        imdbID:
          link-to-url: "https://www.imdb.com/title/{value}/"
        Title:
          link-to-url: "https://de.wikipedia.org/wiki/{value}"
        imdbRating:
          precision: 1
          plot:
            bars:
              scale: linear
              domain:
                - 1
                - 10
        Rated:
          plot-view-legend: true
          plot:
            heatmap:
              scale: ordinal
              color-scheme: accent

  oscar-plot:
    dataset: oscars
    desc: |
      ## My beautiful oscar scatter plot
      *So many great actors and actresses*
    render-plot:
      spec-path: ".examples/specs/oscars.vl.json"

  movies-plot:
    dataset: movies
    desc: |
      All movies with its *runtime* and *ratings* plotted over *time*.
    render-plot:
      spec-path: ".examples/specs/movies.vl.json"

The solution: Datavzrd

https://datavzrd.github.io

Felix Wiegand, P44

https://datavzrd.github.io

Portability

├── index.html
├── movies
│   ├── index_1.html
│   └── table.js
├── oscars
│   ├── index_1.html
│   └── table.js
├── movies-plot
│   └── index_1.html
├── oscar-plot
│   └── index_1.html
└── static
    ├── bootstrap.bundle.min.js
    ├── bootstrap.min.css
    ├── bootstrap-select.min.css
    ├── bootstrap-select.min.js
    ├── bootstrap-table-filter-control.min.js
    ├── bootstrap-table-fixed-columns.min.css
    ├── bootstrap-table-fixed-columns.min.js
    ├── bootstrap-table.min.css
    ├── bootstrap-table.min.js
    ├── datavzrd.css
    ├── jquery.min.js
    ├── jsonm.min.js
    ├── lz-string.min.js
    ├── showdown.min.js
    ├── vega-embed.min.js
    ├── vega-lite.min.js
    └── vega.min.js

static HTML, no server needed
load data via script tags
javascript compliant compression
\(\leq\) 30000 rows: in-memory filter/sort/paging
\(>\) 30000 rows: paging/search projected to filesystem structure

https://datavzrd.github.io

Datavzrd + Snakemake

rule datavzrd:
    input:
        config="resources/{sample}.datavzrd.yaml",
        table="data/{sample}.tsv",
    output:
        report(
            directory("results/datavzrd-report/{sample}"),
            htmlindex="index.html",
        ),
    wrapper:
        "v4.6.0/utils/datavzrd"

https://datavzrd.github.io

Datavzrd + Snakemake

https://datavzrd.github.io

Varlociraptor

Johannes Köster, Louis Dijkstra, Tobias Marschall, Alexander Schönhuth.
Varlociraptor: enhancing sensitivity and controlling false discovery rate in somatic indel discovery. Genome Biology 21, 2020

https://varlociraptor.github.io

Traditional variant calling approach:

detection

genotyping/VAF prediction

Varlociraptor approach:

candidate variant calling

event calling

same tool

any tool

Varlociraptor

Johannes Köster, Louis Dijkstra, Tobias Marschall, Alexander Schönhuth.
Varlociraptor: enhancing sensitivity and controlling false discovery rate in somatic indel discovery. Genome Biology 21, 2020

https://varlociraptor.github.io

Varlociraptor

germline: "normal:0.5 | normal:1.0"
somatic_normal: "normal:]0.0,0.5["
somatic_tumor: "normal:0.0 & tumor:]0.0,1.0]"
absent: "normal:0.0 & tumor:0.0

https://varlociraptor.github.io

Varlociraptor model building block

\(\xi_i \sim \text{Bernoulli}(\theta \tau)\)

\(\omega_i \sim Bernoulli(\pi_i)\)

\(Z_i \mid \xi_i, \omega_i=1,\beta,\delta \sim\)

\(\beta, \delta\)

allele frequency

sampling bias

allele uncertainty

biases/artifacts (strand, orientation, softclip, homopolymer, ...)

alignment uncertainty

https://varlociraptor.github.io

Variant calling grammar

species:
  heterozygosity: 0.001
  ploidy:
    male:
      all: 2
      X: 1
      Y: 1
    female:
      all: 2
      X: 2
      Y: 0

samples:
  jane:
    sex: female

events:
  present: "jane:0.5 | jane:1.0"

https://varlociraptor.github.io

species:
  heterozygosity: 0.001
  ploidy:
    male:
      all: 2
      X: 1
      Y: 1
    female:
      all: 2
      X: 2
      Y: 0

samples:
  jane:
    sex: female
  john:
    sex: male
  james:
    sex: male
    inheritance:
      mendelian:
        mother: jane
        father: john

events:
  john: "john:0.5 | john:1.0"
  jane: "jane:0.5 | jane:1.0"
  denovo_james: "(james:0.5 | james:1.0) & !$jane & !$john"

Variant calling grammar

https://varlociraptor.github.io

species:
  heterozygosity: 0.001
  ploidy:
    male:
      all: 2
      X: 1
      Y: 1
    female:
      all: 2
      X: 2
      Y: 0

samples:
  jane:
    sex: female
    somatic-effective-mutation-rate: 1e-10
  tumor:
    inheritance:
      clonal:
        from: jane
    contamination:
      by: jane
      fraction: 0.1
    somatic-effective-mutation-rate: 1e-6

events:
  germline: "jane:0.5 | jane:1.0"
  somatic: "jane:]0.0,0.5["
  somatic_tumor_low: "jane:0.0 & tumor:]0.0,0.1["
  somatic_tumor_high: "jane:0.0 & tumor:[0.1,1.0]"

Variant calling grammar

https://varlociraptor.github.io

samples:
  jane:
    sex: female
    somatic-effective-mutation-rate: 1e-10
  tumor:
    inheritance:
      clonal:
        from: jane
    contamination:
      by: jane
      fraction: 0.1
    somatic-effective-mutation-rate: 1e-6
  relapse:
    inheritance:
      clonal:
        from: jane
    contamination:
      by: jane
      fraction: 0.2
    somatic-effective-mutation-rate: 1e-6

expressions:
  somatic_tumor: "jane:0.0 & tumor:]0.0,1.0]"

events:
  germline: "jane:0.5 | jane:1.0"
  somatic: "jane:]0.0,0.5["
  somatic_tumor_no_increase: "$somatic_tumor & l2fc(relapse,tumor) < 1"
  somatic_tumor_increase: "$somatic_tumor & l2fc(relapse,tumor) >= 1"
  somatic_relapse: "jane:0.0 & tumor:0.0 & relapse:]0.0,1.0]"

Variant calling grammar

https://varlociraptor.github.io

species:
  heterozygosity: 0.001
  ploidy:
    male:
      all: 2
      X: 1
      Y: 1
    female:
      all: 2
      X: 2
      Y: 0

samples:
  dna_illumina:
    sex: female
  dna_nanopore:
    inheritance:
      clonal:
        from: dna_illumina
  rna_illumina:
    universe: [0.0,1.0]

events:
  het: "dna_illumina:0.5 & dna_nanopore:0.5 & rna_illumina:]0.0,1.0]"
  hom: "dna_illumina:1.0 & dna_nanopore:1.0 & rna_illumina:1.0"
  rna_editing: "dna_illumina:0.0 & dna_nanopore:0.0 & rna_illumina:]0.0,1.0]"

Variant calling grammar

https://varlociraptor.github.io

Long read support

https://varlociraptor.github.io

Discovering nucleotide level and structural variants in cancer genome data from second and third generation sequencing technologies. Till Hartmann, PhD thesis 2024

Continuous testing

varlociraptor call variants --testcase-prefix testcase --testcase-locus CHROM:POS generic \
  --scenario scenario.yaml --obs tumor=tumor.bcf normal=normal.bcf

Automatic test case generation:

145

public testcases (simulated + real benchmarks)

private testcases
(clinical)

https://varlociraptor.github.io

Reporting with Datavzrd

https://varlociraptor.github.io

Beyond just variant calling

Haplotype quantification:

utilizing Varlociraptor's using posterior allele frequency distributions [Hamdiye Uzuner, talk today]

Methylation calling:

joint consideration of methylation and variation, any sequencing technology [Adrian Prinz, P40]

Optical mapping evidence:

extension of statistical model to consider optical mapping label positions to determine SV allele likelihoods [Laura Kühle]

Phased variant impact prediction:

leveraging matching between Varlociraptor observations and variants [Felix Wiegand]

Fusion calling:

using generic breakends [Felix Mölder]

Acknowledgements

Koesterlab

Laura Kühle

David Lähnemann

Felix Mölder

Adrian Prinz

Hamdiye Uzuner

Felix Wiegand

Can Özkan

Andrea Tonk

Till Hartmann (alumn.)

Snakemake Community

Christian Meesters

Michael B. Hall

Filipe G. Vieira

Morten E. Lund

Vanessa Sochat

Alexaner Kanitz

Kim-Phillip Jablonski

Brice Letcher

Michael B. Hall

Chris Tomkins-Tinch

Sven O. Twardziok

Manuel Holtgrewe

+ hundreds of contributors

Bioconda Community

Björn Grüning

Ryan Dale

Marcell Bargull

Brad Chapman

Chris Tomkins-Tinch

Andreas Sjödin

Devon Ryan

Elmar Pruesse

Robert A. Petit III

Christian Brueffer

Alicia Evans

+ hundreds of contributors

Others

Sven Rahmann

Alex Schönhuth

Shirley Liu

Myles Brown

Henry Long

Louis Dijkstra

Tobias Marschall

Marcel Martin

Sven Nahnsen

Alex Schramm

Ina Pretzell

Michael Wessolly

Thomas Herold

Martin Schuler

all users of Snakemake, Bioconda, Datavzrd, and Varlociraptor

Snakemake hackathon 2025

March 10-14 at

Work on:

plugins
wrappers
workflows
internals

Visit LHC:

And shape the future of the ecosystem!

Registration will open soon, email me to get a notification: johannes.koester@uni-due.de

free lodging sponsored by FAIROS-HEP!

Keynote GCB 2024

By Johannes Köster

Keynote GCB 2024

7 months ago
219

Generic solutions for specific problems: Snakemake, Datavzrd, Bioconda, and Varlociraptor in 2024

Snakemake

Snakemake

Define workflows

in terms of rules

Define workflows

in terms of rules

Define workflows

in terms of rules

Define workflows

in terms of rules

Automatic inference of DAG of jobs

Scalable to any platform

Self-contained HTML reports

Snakemake 8: plugin architecture

Simplicity and automation

Simplicity and automation

Simplicity and automation

Simplicity and automation

Simplicity and automation

Bioconda

Software installation is heterogeneous

Package management with Conda

The Bioconda Project

The Bioconda Project

ISMB Proceedings 2023

Datavzrd

Tables are the central entity in data analysis

Not always a single table

Not always just a table

State of the art

The problem

The solution: Datavzrd

Portability

Datavzrd + Snakemake

Datavzrd + Snakemake

Varlociraptor

Varlociraptor

Varlociraptor

Varlociraptor

Varlociraptor model building block

Variant calling grammar

Variant calling grammar

Variant calling grammar

Variant calling grammar

Variant calling grammar

Long read support

Continuous testing

Reporting with Datavzrd

Beyond just variant calling

Acknowledgements

Snakemake hackathon 2025

Keynote GCB 2024

More from Johannes Köster

Generic solutions for specific problems: Snakemake, Datavzrd, Bioconda,
and Varlociraptor in 2024

Tables are the central entity
in data analysis