Johannes Köster
University of Duisburg-Essen
dataset
results
"Let me do that by hand..."
dataset
results
dataset
dataset
dataset
dataset
dataset
"Let me do that by hand..."
Reproducibility
Transparency
Adaptability
>567k downloads since 2015
>1500 citations
>7 citations per week in 2021
Reproducibility
Transparency
Adaptability
Reproducibility
Transparency
Adaptability
dataset
results
dataset
dataset
dataset
dataset
dataset
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
script:
"scripts/myscript.R"
rule myfiltration:
input:
"result/{dataset}.txt"
output:
"result/{dataset}.filtered.txt"
shell:
"mycommand {input} > {output}"
rule aggregate:
input:
"results/dataset1.filtered.txt",
"results/dataset2.filtered.txt"
output:
"plots/myplot.pdf"
script:
"scripts/myplot.R"
rule mytask:
input:
"data/{sample}.txt"
output:
"result/{sample}.txt"
shell:
"some-tool {input} > {output}"
rule name
how to create output from input
define
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
script:
"scripts/myscript.R"
rule myfiltration:
input:
"result/{dataset}.txt"
output:
"result/{dataset}.filtered.txt"
shell:
"mycommand {input} > {output}"
rule aggregate:
input:
"results/dataset1.filtered.txt",
"results/dataset2.filtered.txt"
output:
"plots/myplot.pdf"
script:
"scripts/myplot.R"
Solution:
Problem:
for a given set of targets, find a composition of rules to create them
rule all:
input:
"D1.sorted.txt",
"D2.sorted.txt",
"D3.sorted.txt"
rule sort:
input:
"path/to/{dataset}.txt"
output:
"{dataset}.sorted.txt"
shell:
"sort {input} > {output}"
Job 1: apply rule all
(a target rule that just collects results)
Job i: apply rule sort to create i-th input of job 1
DATASETS = ["D1", "D2", "D3"]
rule all:
input:
["{dataset}.sorted.txt".format(dataset=ds)
for ds in DATASETS]
rule sort:
input:
"path/to/{dataset}.txt"
output:
"{dataset}.sorted.txt"
shell:
"sort {input} > {output}"
Job 1: apply rule all
(a target rule that just collects results)
Job i: apply rule sort to create i-th input of job 1
use arbitrary Python code in your workflow
DATASETS = ["D1", "D2", "D3"]
rule all:
input:
expand("{dataset}.sorted.txt", dataset=DATASETS)
rule sort:
input:
"path/to/{dataset}.txt"
output:
"{dataset}.sorted.txt"
shell:
"sort {input} > {output}"
Job 1: apply rule all
(a target rule that just collects results)
Job i: apply rule sort to create i-th input of job 1
use arbitrary Python code in your workflow
rule mytask:
input:
"data/{sample}.txt"
output:
"result/{sample}.txt"
script:
"scripts/myscript.py"
reusable scripts:
import pandas as pd
data = pd.read_table(snakemake.input[0])
data = data.sort_values("id")
data.to_csv(snakemake.output[0], sep="\t")
Python:
data <- read.table(snakemake@input[[1]])
data <- data[order(data$id),]
write.table(data, file = snakemake@output[[1]])
R:
import polar as pl
pl.read_csv(&snakemake.input[0])
.sort()
.to_csv(&snakemake.output[0])
Rust:
rule mytask:
input:
"data/{sample}.txt"
output:
"result/{sample}.txt"
notebook:
"notebooks/mynotebook.ipynb"
rule map_reads:
input:
"{sample}.bam"
output:
"{sample}.sorted.bam"
wrapper:
"0.22.0/bio/samtools/sort"
reuseable wrappers from central repository
rule mytask:
input:
"data/{sample}.txt"
output:
temp("result/{sample}.txt")
shell:
"some-tool {input} > {output}"
rule mytask:
input:
"data/{sample}.txt"
output:
protected("result/{sample}.txt")
shell:
"some-tool {input} > {output}"
rule mytask:
input:
"data/{sample}.txt"
output:
pipe("result/{sample}.txt")
shell:
"some-tool {input} > {output}"
configfile: "config/config.yaml"
module dna_seq:
snakefile:
github("snakemake-workflows/dna-seq-gatk-variant-calling", path="workflow/Snakefile", tag="v1.17.0")
config:
config
use rule * from dna_seq
# easily extend the workflow
rule plot_vafs:
input:
"filtered/all.vcf.gz"
output:
"results/plots/vafs.svg"
notebook:
"notebooks/plot-vafs.py.ipynb"
configfile: "config/config.yaml"
module dna_seq:
snakefile:
github("snakemake-workflows/dna-seq-gatk-variant-calling", path="workflow/Snakefile", tag="v1.17.0")
config:
config["dna-seq"]
use rule * from dna_seq as dna_seq_*
# easily extend the workflow
rule plot_vafs:
input:
"filtered/all.vcf.gz"
output:
"results/plots/vafs.svg"
notebook:
"notebooks/plot-vafs.py.ipynb"
module rna_seq:
snakefile:
github("snakemake-workflows/rna-seq-kallisto-sleuth", path="workflow/Snakefile", tag="v1.0.0")
config:
config["rna-seq"]
use rule * from rna_seq as rna_seq_*
Reproducibility
Transparency
Adaptability
A job is executed if and only if
determined via breadth-first-search on DAG of jobs
rule sort:
input:
"path/to/{dataset}.txt"
output:
"{dataset}.sorted.txt"
priority: 1
threads: 4
resources: mem_mb=100
shell:
"sort --parallel {threads} {input} > {output}"
refer to defined thread number
define arbitrary additional resources
define used threads
job selection
job resource usage
free resources
job temp file consumption
temp file lifetime fraction
job priority
job thread usage
temp file size
temp file deletion
--groups a=g1 b=g1
--groups a=g1 b=g1 --group-components g1=2
--groups a=g1 b=g1 --group-components g1=5
workstation
compute server
cluster
grid computing
cloud computing
dataset
results
dataset
dataset
dataset
dataset
dataset
shared data
Reproducibility
Transparency
Adaptability
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
conda:
"envs/some-tool.yaml"
shell:
"some-tool {input} > {output}"
channels:
- conda-forge
dependencies:
- some-tool =2.3.1
- some-lib =1.1.2
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
container:
"docker://biocontainers/some-tool#2.3.1"
shell:
"some-tool {input} > {output}"
containerized:
"docker://username/myworkflow:1.0.0"
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
conda:
"envs/some-tool.yaml"
shell:
"some-tool {input} > {output}"
snakemake --containerize > Dockerfile
Reproducibility
Transparency
Adaptability
https://snakemake.github.io
├── .gitignore
├── README.md
├── LICENSE.md
├── workflow
│ ├── resources
| │ └── someresouce.bed
│ ├── rules
| │ ├── some_task.smk
| │ └── some_other_task.smk
│ ├── envs
| │ ├── tool1.yaml
| │ └── tool2.yaml
│ ├── scripts
| │ ├── script1.py
| │ └── script2.R
│ ├── notebooks
| │ ├── notebook1.py.ipynb
| │ └── notebook2.r.ipynb
│ ├── report
| │ ├── plot1.rst
| │ └── plot2.rst
| └── Snakefile
├── config
│ ├── config.yaml
│ └── some-sheet.tsv
├── results
└── resources
Goal:
Help readers to navigate to those parts they are interested in.
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
log:
"logs/mytask/{dataset}.log"
conda:
"envs/myenv.yaml"
shell:
"some-tool {input} > {output} 2> {log}"