Snakemake Tutorial

Johannes Köster

 

2024

https://koesterlab.github.io

dataset

results

Data analysis

"Let me do that by hand..."

dataset

results

dataset

dataset

dataset

dataset

dataset

"Let me do that by hand..."

Data analysis

dataset

results

dataset

dataset

dataset

dataset

dataset

automation

From raw data to final figures:

  • document parameters, tools, versions
  • execute without manual intervention

Reproducible data analysis

dataset

results

dataset

dataset

dataset

dataset

dataset

scalability

Handle parallelization:

  • execute for tens to thousands of datasets
  • efficiently use any computing platform

automation

Reproducible data analysis

dataset

results

dataset

dataset

dataset

dataset

dataset

Handle deployment:

be able to easily execute analyses on a different system/platform/infrastructure

portability

scalability

automation

Reproducible data analysis

Beyond reproducibility

>370k downloads since 2015

Snakemake

>1300 citations

>7 citations per week in 2021

Part 1

Basics

dataset

results

dataset

dataset

dataset

dataset

dataset

Define workflows

in terms of rules

Define workflows

in terms of rules

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    script:
        "scripts/myscript.R"


rule myfiltration:
     input:
        "result/{dataset}.txt"
     output:
        "result/{dataset}.filtered.txt"
     shell:
        "mycommand {input} > {output}"


rule aggregate:
    input:
        "results/dataset1.filtered.txt",
        "results/dataset2.filtered.txt"
    output:
        "plots/myplot.pdf"
    script:
        "scripts/myplot.R"

Define workflows

in terms of rules

Define workflows

in terms of rules

rule sort:
    input:
        "path/to/dataset.txt"
    output:
        "dataset.sorted.txt"
    shell:
        "sort {input} > {output}"

rule name

refer to input and output from shell command

how to create output from input

Define workflows

in terms of rules

rule sort:
    input:
        "path/to/{dataset}.txt"
    output:
        "{dataset}.sorted.txt"
    shell:
        "sort {input} > {output}"

generalize rules with

named wildcards

Define workflows

in terms of rules

rule sort_and_annotate:
    input:
        "path/to/{dataset}.txt",
        "path/to/annotation.txt"
    output:
        "{dataset}.sorted.txt"
    shell:
        "paste <(sort {input[0]}) {input[1]} > {output}"

multiple input or output files

refer by index

Define workflows

in terms of rules

rule sort_and_annotate:
    input:
        a="path/to/{dataset}.txt",
        b="path/to/annotation.txt"
    output:
        "{dataset}.sorted.txt"
    shell:
        "paste <(sort {input.a}) {input.b} > {output}"

name input and output files

refer by name

Define workflows

in terms of rules

rule sort:
    input:
        "path/to/{dataset}.txt"
    output:
        "{dataset}.sorted.txt"
    run:
        with open(output[0], "w") as out:
            for l in sorted(open(input[0])):
                print(l, file=out)

use Python in rules

External scripts

rule sort:
    input:
        "path/to/{dataset}.txt"
    output:
        "{dataset}.sorted.txt"
    script:
        "scripts/myscript.py"

refer to scripts:

  • Python
  • R
  • Julia
  • Rust
  • Bash

External scripts

import pandas as pd

data = pd.read_table(snakemake.input[0])
data = data.sort_values("id")
data.to_csv(snakemake.output[0], sep="\t")

Python scripts:

External scripts

data <- read.table(snakemake@input[[1]])
data <- data[order(data$id),]
write.table(data, file = snakemake@output[[1]])

R scripts:

Jupyter notebooks

rule plot_histogram:
    input:
        "path/to/{dataset}.txt"
    output:
        "plots/{dataset}.hist.pdf"
    notebook:
        "notebooks/plot-histogram.py.ipynb"

draft, update or reuse notebook

Dependencies are determined top-down

Solution:

  • For a given target, find a rule that can be applied to create it (a job).
  • For the input files of the job, go on recursively.
  • If the command specifies no target, Snakemake tries to apply the first rule in the workflow.

Problem:

for a given set of targets, find a composition of rules to create them

Dependencies are determined top-down




rule all:
    input:
        "D1.sorted.txt",
        "D2.sorted.txt",
        "D3.sorted.txt"


rule sort:
    input:
        "path/to/{dataset}.txt"
    output:
        "{dataset}.sorted.txt"
    shell:
        "sort {input} > {output}"

Job 1: apply rule all

(a target rule that just collects results)

Job i: apply rule sort to create i-th input of job 1

Dependencies are determined top-down

DATASETS = ["D1", "D2", "D3"]


rule all:
    input:
        ["{dataset}.sorted.txt".format(dataset=ds)
         for ds in DATASETS]



rule sort:
    input:
        "path/to/{dataset}.txt"
    output:
        "{dataset}.sorted.txt"
    shell:
        "sort {input} > {output}"

Job 1: apply rule all

(a target rule that just collects results)

Job i: apply rule sort to create i-th input of job 1

use arbitrary Python code in your workflow

Dependencies are determined top-down

DATASETS = ["D1", "D2", "D3"]


rule all:
    input:
        expand("{dataset}.sorted.txt", dataset=DATASETS)




rule sort:
    input:
        "path/to/{dataset}.txt"
    output:
        "{dataset}.sorted.txt"
    shell:
        "sort {input} > {output}"

Job 1: apply rule all

(a target rule that just collects results)

Job i: apply rule sort to create i-th input of job 1

use arbitrary Python code in your workflow

Directed acyclic graph (DAG) of jobs

Job execution

A job is executed if and only if any of these conditions applies:

  • output file is target and does not exist
  • output file needed by another executed job and does not exist
  • input file newer than output file
  • rule has been modified
  • input file will be updated by other job
  • execution is enforced

checked with breadth-first-search on DAG of jobs

Command line interface

Assumption: rules defined in a Snakefile in the same directory or a workflow sub-directory.

# execute the workflow with target D1.sorted.txt
snakemake D1.sorted.txt --cores 1

# execute the workflow without target: first rule defines target
snakemake --cores 2

# dry-run
snakemake -n

# dry-run, print shell commands
snakemake -n -p

# dry-run, print execution reason for each job
snakemake -n -r

# visualize the DAG of jobs using the Graphviz dot command
snakemake --dag | dot -Tsvg > dag.svg

Part 2

Advanced

https://doi.org/10.12688/f1000research.29032.1

Parallelization

Disjoint paths in the DAG of jobs can be executed in parallel.

# execute the workflow with 8 cores
snakemake --cores 8

execute 8 jobs in parallel?

Defining resources

rule sort:
    input:
        "path/to/{dataset}.txt"
    output:
        "{dataset}.sorted.txt"
    threads: 4
    resources: mem_mb=100
    shell:
        "sort --parallel {threads} {input} > {output}"

refer to defined thread number

define arbitrary additional resources

define used threads

Defining resources

rule sort:
    input:
        "path/to/{dataset}.txt"
    output:
        "{dataset}.sorted.txt"
    threads: 4
    resources: mem_mb=100
    shell:
        "sort --parallel {threads} {input} > {output}"
# execute the workflow with 8 cores
snakemake --cores 8

can execute 2 sort jobs in parallel

Defining resources

rule sort:
    input:
        "path/to/{dataset}.txt"
    output:
        "{dataset}.sorted.txt"
    threads: 4
    resources: mem_mb=100
    shell:
        "sort --parallel {threads} {input} > {output}"
# execute the workflow with 2 cores
snakemake --cores 2

can execute 1 sort job in parallel (automatically using 2 threads)

Defining resources

rule sort:
    input:
        "path/to/{dataset}.txt"
    output:
        "{dataset}.sorted.txt"
    threads: 4
    resources: mem_mb=100
    shell:
        "sort --parallel {threads} {input} > {output}"
# execute the workflow with 10 cores
snakemake --cores 10

can execute 2 sort jobs in parallel

Defining resources

rule sort:
    input:
        "path/to/{dataset}.txt"
    output:
        "{dataset}.sorted.txt"
    threads: 4
    resources: mem_mb=100
    shell:
        "sort --parallel {threads} {input} > {output}"
# execute the workflow with 10 cores and 100MB memory
snakemake --cores 10 --resources mem_mb=100

can execute 1 sort job in parallel

Scheduling

Available jobs are scheduled to

  • maximize parallelization
  • prefer high priority jobs
  • minimize temp file lifetime

while satisfying resource constraints.

\max U_t \cdot 2S \cdot \sum_{j \in J} x_j \cdot p_j + 2S \cdot \sum_{j \in J} x_j \cdot u_{t,j}\\ + S \cdot \sum_{f \in F} \gamma_f \cdot S_f + \sum_{f \in F} \delta_f \cdot S_f
\sum_{j \in J} x_j \cdot u_{r,j} \leq U_r \quad \forall r \in R
\delta_f \leq \frac{\sum_{j \in J} x_j \cdot z_{f,j}}{\sum_{j \in J} z_{f,j}} \quad\forall f \in F
\text{subject to:}

job selection

job resource usage

free resources

job temp file consumption

temp file lifetime fraction

job priority

job thread usage

Scheduling

temp file size

temp file deletion

\gamma_f \leq \delta_f \quad\forall f \in F
\gamma_f \in \{0,1\}
\delta_f \in [0,1]
x_j \in \{0,1\}

Scalable to any platform

workstation

compute server

cluster

grid computing

cloud computing

Config files

configfile: "config.yaml"


rule all:
    input:
        expand("{dataset}.sorted.txt", dataset=config["datasets"])


rule sort:
    input:
        "path/to/{dataset}.txt"
    output:
        "{dataset}.sorted.txt"
    threads: 4
    resources: mem_mb=100
    shell:
        "sort --parallel {threads} {input} > {output}"

define config file

refer to config values

Input functions

Workflows are executed in three phases

  • initialization phase (parsing)
  • DAG phase (DAG is built)
  • scheduling phase (execution of DAG)

Input functions defer determination of input files to the DAG phase

(when wildcard values are known).

Input functions

configfile: "config.yaml"


rule all:
    input:
        expand("{dataset}.sorted.txt", dataset=config["datasets"])


def get_sort_input(wildcards):
    return config["datasets"][wildcards.dataset]


rule sort:
    input:
        get_sort_input
    output:
        "{dataset}.sorted.txt"
    threads: 4
    resources: mem_mb=100
    shell:
        "sort --parallel {threads} {input} > {output}"
  • initialization phase (parsing)
  • DAG phase (DAG is built)
  • scheduling phase (execution of DAG)

New in Snakemake 8: semantic helper functions

rule mytask:
    input:
        branch(
            lookup(dpath="switches/mark_duplicates", within=config),
            then="results/mapped/{sample}.bam",
            otherwise="results/mapped/{sample}.rmdup.bam",
        )
    output:
        "results/{sample}.txt"
    script:
        "scripts/somestep.py"

Logging

configfile: "config.yaml"


rule all:
    input:
        expand("{dataset}.sorted.txt", dataset=config["datasets"])


def get_sort_input(wildcards):
    return config["datasets"][wildcards.dataset]


rule sort:
    input:
        get_sort_input
    output:
        "{dataset}.sorted.txt"
    log:
        "logs/sort/{dataset}.log"
    threads: 4
    resources: mem_mb=100
    shell:
        "sort --parallel {threads} {input} > {output} 2> {log}"

define log file

refer to log file from shell command

Cluster/cloud execution

Configuration profiles

snakemake --profile lsf --jobs 1000
$HOME/.config/snakemake/lsf
├── config.yaml
├── jobscript.sh
├── status.py
└── submit.py

Reproducible software installation

dataset

results

dataset

dataset

dataset

dataset

dataset

Full reproducibility:

install required software and all dependencies in exact versions

Software installation is heterogeneous

source("https://bioconductor.org/biocLite.R")
biocLite("DESeq2")
easy_install snakemake
./configure --prefix=/usr/local
make
make install
cp lib/amd64/jli/*.so lib
cp lib/amd64/*.so lib
cp * $PREFIX
cpan -i bioperl
cmake ../../my_project \
    -DCMAKE_MODULE_PATH=~/devel/seqan/util/cmake \
    -DSEQAN_INCLUDE_PATH=~/devel/seqan/include
make
make install
apt-get install bwa
yum install python-h5py
install.packages("matrixpls")

Package management with

package:
  name: seqtk
  version: 1.2

source:
  fn: v1.2.tar.gz
  url: https://github.com/lh3/seqtk/archive/v1.2.tar.gz

requirements:
  build:
    - gcc
    - zlib
  run:
    - zlib

about:
  home: https://github.com/lh3/seqtk
  license: MIT License
  summary: Seqtk is a fast and lightweight tool for processing sequences

test:
  commands:
    - seqtk seq

Idea:

Normalize installation via recipes

#!/bin/bash

export C_INCLUDE_PATH=${PREFIX}/include
export LIBRARY_PATH=${PREFIX}/lib

make all
mkdir -p $PREFIX/bin
cp seqtk $PREFIX/bin
  • source or binary
  • recipe and build script
  • package

Easy installation and management:

no admin rights needed

conda install pandas

conda update pandas

conda remove pandas

conda env create -f myenv.yaml -n myenv

Isolated environments:

channels:
  - conda-forge
  - nodefaults
dependencies:
  - pandas ==0.20.3
  - statsmodels ==0.8.0
  - r-dplyr ==0.7.0
  - r-base ==3.4.1
  - python ==3.6.0

Package management with

  • Over 8000 bioinformatics related packages (C, C++, Python, R, Perl, ...).
  • Over 140 million downloads.

Over 650 contributors

CONDA-FORGE

Partner project for general purpose software:

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    conda:
        "envs/mycommand.yaml"
    shell:
        "mycommand {input} > {output}"

Integration with Snakemake

channels:
  - bioconda
  - conda-forge
  - nodefaults
dependencies:
  - mycommand ==2.3.1
# automatic deployment of dependencies
snakemake --use-conda

Integrated with popular workflow management systems

Conda Integration with WMS

Containers

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    container:
        "docker://biocontainers/some-tool#2.3.1"
    shell:
        "some-tool {input} > {output}"

Containerization

containerized:
    "docker://username/myworkflow:1.0.0"


rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    conda:
        "envs/some-tool.yaml"
    shell:
        "some-tool {input} > {output}"
snakemake --containerize > Dockerfile

Self-contained HTML reports

Wrappers

rule samtools_sort:
    input:
        "mapped/{sample}.bam"
    output:
        "mapped/{sample}.sorted.bam"
    params:
        "-m 4G"
    threads: 8
    wrapper:
        "0.2.0/bio/samtools/sort"
# automatic deployment of dependencies
snakemake --use-conda

Wrappers

Distribution of workflows

Git repository with

├── .gitignore
├── README.md
├── LICENSE.md
├── config
│   └── config.yaml
├── workflow
│   ├── envs
│   │   ├── env1.yaml
│   │   └── env2.yaml
│   ├── report
│   │   ├── fig1.rst
│   │   ├── fig2.rst
│   │   └── workflow.rst
│   ├── rules
│   │   ├── rules1.smk
│   │   └── rules2.smk
│   ├── scripts
│   │   ├── script1.py
│   │   └── script2.R
│   ├── notebooks
│   │   ├── notebook1.py.ipynb
│   │   └── notebook2.r.ipynb
│   └── Snakefile
├── resources
└── results

Using and combining workflows

configfile: "config/config.yaml"

module dna_seq:
    snakefile:
        "https://github.com/snakemake-workflows/dna-seq-gatk-variant-calling/raw/v2.0.1/Snakefile"
    config:
        config

use rule * from dna_seq
# easily extend the workflow
rule plot_vafs:
    input:
        "filtered/all.vcf.gz"
    output:
        "results/plots/vafs.svg"
    notebook:
        "notebooks/plot-vafs.py.ipynb"

Using and combining workflows

configfile: "config/config.yaml"

module dna_seq:
    snakefile:
        "https://github.com/snakemake-workflows/dna-seq-gatk-variant-calling/raw/v2.0.1/Snakefile"
    config:
        config["dna-seq"]

use rule * from dna_seq as dna_seq_*
# easily extend the workflow
rule plot_vafs:
    input:
        "filtered/all.vcf.gz"
    output:
        "results/plots/vafs.svg"
    notebook:
        "notebooks/plot-vafs.py.ipynb"
module rna_seq:
    snakefile:
        "https://github.com/snakemake-workflows/rna-seq-kallisto-sleuth/raw/v1.0.0/Snakefile"
    config:
        config["rna-seq"]

use rule * from rna_seq as rna_seq_*

Many additional features

  • piping between jobs
  • group jobs
  • handling of temporary and protected files
  • tracking of parameter and code changes
  • remote file support (S3/Google, Dropbox, HTTPS, FTP, ...)
  • dynamic conditional execution depending on output

Acknowledgements

Contributors:

Andreas Wilm

Anthony Underwood

Ryan Dale

David Alexander

Elias Kuthe

Elmar Pruesse

Hyeshik Chang

Jay Hesselberth

Jesper Foldager

John Huddleston

all users and supporters

Joona Lehtomäki

Justin Fear

Karel Brinda

Karl Gutwin

Kemal Eren

Kostis Anagnostopoulos

Kyle A. Beauchamp

Simon Ye

Tobias Marschall

Willem Ligtenberg

Development team:

Christopher Tomkins-Tinch

David Koppstein

Tim Booth

Manuel Holtgrewe

Christian Arnold

Wibowo Arindrarto

Rasmus Ågren

 

Kyle Meyer

Lance Parsons

Manuel Holtgrewe

Marcel Martin

Matthew Shirley

Mattias Franberg

Matt Shirley

Paul Moore

percyfal

Per Unneberg

 

Ryan C. Thompson

Ryan Dale

Sean Davis

Resources

Rolling paper:

https://doi.org/10.12688/f1000research.29032.1

Documentation and change log:

https://snakemake.readthedocs.io

Questions:

http://stackoverflow.com/questions/tagged/snakemake

Wrappers:

https://snakemake-wrappers.readthedocs.io

Gold standard workflows:

https://github.com/snakemake-workflows/docs

Workflow catalog:

https://snakemake.github.io/snakemake-workflow-catalog

Configuration profiles:

https://github.com/snakemake-profiles/doc

Code linting:

snakemake --lint

Code formatting:

https://github.com/snakemake/snakefmt

Command line help:

snakemake --help

Editors:

vscode, pycharm, theia, vim, emacs