Transparency, reproducibility and the democratization of an ecosystem - the benefits of Snakemake 8
Johannes Köster
2024
dataset
results
Data analysis
"Let me do that by hand..."
dataset
results
dataset
dataset
dataset
dataset
dataset
"Let me do that by hand..."
Data analysis
- check computational validity
- apply same analysis to new data
- check methodological validity
- understand analysis
Data analysis
Reproducibility
Transparency
- modify analysis
- extend analysis
Adaptability
>1 million downloads since 2015
>2700 citations
>11 citations per week in 2023
- automation
- scalability
- portability
- readability
- documentation
- traceability
Data analysis
Reproducibility
Transparency
- readability
- portability
- scalability
Adaptability
Data analysis
- automation
- scalability
- portability
- readability
- documentation
- traceability
Reproducibility
Transparency
- readability
- portability
- scalability
Adaptability
dataset
results
dataset
dataset
dataset
dataset
dataset
Define workflows
in terms of rules
Define workflows
in terms of rules
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
script:
"scripts/myscript.R"
rule myfiltration:
input:
"result/{dataset}.txt"
output:
"result/{dataset}.filtered.txt"
shell:
"mycommand {input} > {output}"
rule aggregate:
input:
"results/dataset1.filtered.txt",
"results/dataset2.filtered.txt"
output:
"plots/myplot.pdf"
script:
"scripts/myplot.R"
Define workflows
in terms of rules
Define workflows
in terms of rules
rule mytask:
input:
"data/{sample}.txt"
output:
"result/{sample}.txt"
shell:
"some-tool {input} > {output}"
rule name
how to create output from input
define
- input
- output
- log files
- parameters
- resources
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
script:
"scripts/myscript.R"
rule myfiltration:
input:
"result/{dataset}.txt"
output:
"result/{dataset}.filtered.txt"
shell:
"mycommand {input} > {output}"
rule aggregate:
input:
"results/dataset1.filtered.txt",
"results/dataset2.filtered.txt"
output:
"plots/myplot.pdf"
script:
"scripts/myplot.R"
Automatic inference of DAG of jobs
Boilerplate-free integration of scripts
rule mytask:
input:
"data/{sample}.txt"
output:
"result/{sample}.txt"
script:
"scripts/myscript.py"
reusable scripts:
- Python
- R
- Julia
- Rust
- Bash
import pandas as pd
data = pd.read_table(snakemake.input[0])
data = data.sort_values("id")
data.to_csv(snakemake.output[0], sep="\t")
Python:
data <- read.table(snakemake@input[[1]])
data <- data[order(data$id),]
write.table(data, file = snakemake@output[[1]])
Boilerplate-free integration of scripts
R:
import polar as pl
pl.read_csv(&snakemake.input[0])
.sort()
.to_csv(&snakemake.output[0])
Rust:
Reusable wrappers
rule map_reads:
input:
"{sample}.bam"
output:
"{sample}.sorted.bam"
wrapper:
"0.22.0/bio/samtools/sort"
reuseable wrappers from central repository
Reusable wrappers
Data analysis
- automation
- scalability
- portability
- readability
- documentation
- traceability
Reproducibility
Transparency
- readability
- portability
- scalability
Adaptability
job selection
job resource usage
free resources
job temp file consumption
temp file lifetime fraction
job priority
job thread usage
Scheduling
temp file size
temp file deletion
DAG partitioning
--groups a=g1 b=g1
--groups a=g1 b=g1 --group-components g1=2
--groups a=g1 b=g1 --group-components g1=5
Scalable to any platform
workstation
compute server
cluster
grid computing
cloud computing
Data analysis
- automation
- scalability
- portability
- readability
- documentation
- traceability
Reproducibility
Transparency
- readability
- portability
- scalability
Adaptability
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
conda:
"envs/some-tool.yaml"
shell:
"some-tool {input} > {output}"
Conda integration
channels:
- conda-forge
dependencies:
- some-tool =2.3.1
- some-lib =1.1.2
Container integration
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
container:
"docker://biocontainers/some-tool#2.3.1"
shell:
"some-tool {input} > {output}"
Data analysis
- automation
- scalability
- portability
- readability
- documentation
- traceability
Reproducibility
Transparency
- readability
- portability
- scalability
Adaptability
Self-contained HTML reports
Many more features
- dynamic DAG rewiring
- service jobs (providing sockets, loading databases, or ramdisks)
- semantic helper functions for minimizing boilerplate code
- fallible rules
- caching of shared results across workflows
- transparent handling of remote storage
- interoperability (CWL tool wrappers, integration of Nextflow workflows)
Extensible architecture
Extensible architecture
Language readability
Lookup in config: so far
def get_threshold(wildcards):
return config["some_tool"]["thresholds"].get(wildcards.dataset, 0.1)
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
params:
some_threshold=get_threshold
shell:
"some-tool {input} > {output}"
Lookup in config: now
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
params:
some_threshold=lookup(
dpath="some_tool/thresholds/{dataset}",
within=config,
default=0.1
)
shell:
"some-tool {input} > {output}"
Lookup in sheet: so far
def get_threshold(wildcards):
return sheet.loc[wildcards.dataset, "threshold"]
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
params:
some_threshold=get_threshold
shell:
"some-tool {input} > {output}"
Lookup in sheet: now
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
params:
some_threshold=lookup(
query="dataset == '{dataset}'",
cols="threshold",
within=sheet
)
shell:
"some-tool {input} > {output}"
Branching: so far
def get_mytask_input(wildcards):
return "results/preprocessed/{dataset}" if config["prefilter"]["activate"] else "path/to/{dataset}"
rule mytask:
input:
get_mytask_input
output:
"result/{dataset}.txt"
shell:
"some-tool {input} > {output}"
Branching: now
rule mytask:
input:
branch(
lookup(dpath="prefilter/activate", within=config),
then="results/preprocessed/{dataset}",
otherwise="path/to/{dataset}"
)
output:
"result/{dataset}.txt"
shell:
"some-tool {input} > {output}"
Snakemake workflow catalog
Conclusion
Snakemake covers all aspects of fully reproducible, transparent, and adaptable data analysis, offering
- maximum readability
- ad-hoc integration with scripting and high performance languages
- an extensible architecture
- a plethora of advanced features
https://snakemake.github.io
snakemake-intro-updates-2024
By Johannes Köster
snakemake-intro-updates-2024
- 187