Johannes Köster
2024
dataset
results
"Let me do that by hand..."
dataset
results
dataset
dataset
dataset
dataset
dataset
"Let me do that by hand..."
Reproducibility
Transparency
Adaptability
>1 million downloads since 2015
>2700 citations
>11 citations per week in 2023
Reproducibility
Transparency
Adaptability
Reproducibility
Transparency
Adaptability
dataset
results
dataset
dataset
dataset
dataset
dataset
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
script:
"scripts/myscript.R"
rule myfiltration:
input:
"result/{dataset}.txt"
output:
"result/{dataset}.filtered.txt"
shell:
"mycommand {input} > {output}"
rule aggregate:
input:
"results/dataset1.filtered.txt",
"results/dataset2.filtered.txt"
output:
"plots/myplot.pdf"
script:
"scripts/myplot.R"
rule mytask:
input:
"data/{sample}.txt"
output:
"result/{sample}.txt"
shell:
"some-tool {input} > {output}"
rule name
how to create output from input
define
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
script:
"scripts/myscript.R"
rule myfiltration:
input:
"result/{dataset}.txt"
output:
"result/{dataset}.filtered.txt"
shell:
"mycommand {input} > {output}"
rule aggregate:
input:
"results/dataset1.filtered.txt",
"results/dataset2.filtered.txt"
output:
"plots/myplot.pdf"
script:
"scripts/myplot.R"
rule mytask:
input:
"data/{sample}.txt"
output:
"result/{sample}.txt"
script:
"scripts/myscript.py"
reusable scripts:
import pandas as pd
data = pd.read_table(snakemake.input[0])
data = data.sort_values("id")
data.to_csv(snakemake.output[0], sep="\t")
Python:
data <- read.table(snakemake@input[[1]])
data <- data[order(data$id),]
write.table(data, file = snakemake@output[[1]])
R:
import polar as pl
pl.read_csv(&snakemake.input[0])
.sort()
.to_csv(&snakemake.output[0])
Rust:
rule map_reads:
input:
"{sample}.bam"
output:
"{sample}.sorted.bam"
wrapper:
"0.22.0/bio/samtools/sort"
reuseable wrappers from central repository
Reproducibility
Transparency
Adaptability
job selection
job resource usage
free resources
job temp file consumption
temp file lifetime fraction
job priority
job thread usage
temp file size
temp file deletion
--groups a=g1 b=g1
--groups a=g1 b=g1 --group-components g1=2
--groups a=g1 b=g1 --group-components g1=5
workstation
compute server
cluster
grid computing
cloud computing
Reproducibility
Transparency
Adaptability
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
conda:
"envs/some-tool.yaml"
shell:
"some-tool {input} > {output}"
channels:
- conda-forge
dependencies:
- some-tool =2.3.1
- some-lib =1.1.2
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
container:
"docker://biocontainers/some-tool#2.3.1"
shell:
"some-tool {input} > {output}"
Reproducibility
Transparency
Adaptability
def get_threshold(wildcards):
return config["some_tool"]["thresholds"].get(wildcards.dataset, 0.1)
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
params:
some_threshold=get_threshold
shell:
"some-tool {input} > {output}"
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
params:
some_threshold=lookup(
dpath="some_tool/thresholds/{dataset}",
within=config,
default=0.1
)
shell:
"some-tool {input} > {output}"
def get_threshold(wildcards):
return sheet.loc[wildcards.dataset, "threshold"]
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
params:
some_threshold=get_threshold
shell:
"some-tool {input} > {output}"
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
params:
some_threshold=lookup(
query="dataset == '{dataset}'",
cols="threshold",
within=sheet
)
shell:
"some-tool {input} > {output}"
def get_mytask_input(wildcards):
return "results/preprocessed/{dataset}" if config["prefilter"]["activate"] else "path/to/{dataset}"
rule mytask:
input:
get_mytask_input
output:
"result/{dataset}.txt"
shell:
"some-tool {input} > {output}"
rule mytask:
input:
branch(
lookup(dpath="prefilter/activate", within=config),
then="results/preprocessed/{dataset}",
otherwise="path/to/{dataset}"
)
output:
"result/{dataset}.txt"
shell:
"some-tool {input} > {output}"
Snakemake covers all aspects of fully reproducible, transparent, and adaptable data analysis, offering
https://snakemake.github.io