Sustainable data analysis with
2022
- check computational validity
- apply same analysis to new data
- check methodological validity
- understand analysis
Data analysis
Reproducibility
Transparency
- modify analysis
- extend analysis
Adaptability
>428k downloads since 2015
>1400 citations
>7 citations per week in 2021
- automation
- scalability
- portability
- readability
- documentation
- traceability
Data analysis
Reproducibility
Transparency
- readability
- portability
- scalability
Adaptability
Data analysis
- automation
- scalability
- portability
- readability
- documentation
- traceability
Reproducibility
Transparency
- readability
- portability
- scalability
Adaptability
Define workflows
in terms of rules
rule mytask:
input:
"data/{sample}.txt"
output:
"result/{sample}.txt"
shell:
"some-tool {input} > {output}"
rule name
how to create output from input
define
- input
- output
- log files
- parameters
- resources
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
script:
"scripts/myscript.R"
rule myfiltration:
input:
"result/{dataset}.txt"
output:
"result/{dataset}.filtered.txt"
shell:
"mycommand {input} > {output}"
rule aggregate:
input:
"results/dataset1.filtered.txt",
"results/dataset2.filtered.txt"
output:
"plots/myplot.pdf"
script:
"scripts/myplot.R"
Define workflows
in terms of rules
Boilerplate-free integration of scripts
rule mytask:
input:
"data/{sample}.txt"
output:
"result/{sample}.txt"
script:
"scripts/myscript.py"
reusable scripts:
- Python
- R
- Julia
- Rust
import pandas as pd
data = pd.read_table(snakemake.input[0])
data = data.sort_values("id")
data.to_csv(snakemake.output[0], sep="\t")
Python:
data <- read.table(snakemake@input[[1]])
data <- data[order(data$id),]
write.table(data, file = snakemake@output[[1]])
Boilerplate-free integration of scripts
R:
import polar as pl
pl.read_csv(&snakemake.input[0])
.sort()
.to_csv(&snakemake.output[0])
Rust:
Jupyter notebook integration
rule mytask:
input:
"data/{sample}.txt"
output:
"result/{sample}.txt"
notebook:
"notebooks/mynotebook.ipynb"
- Integrated interactive edit mode.
- Automatic generalization for reuse in other jobs.
Reusable wrappers
rule map_reads:
input:
"{sample}.bam"
output:
"{sample}.sorted.bam"
wrapper:
"0.22.0/bio/samtools/sort"
reuseable wrappers from central repository
Reusable wrappers
Using and combining workflows
configfile: "config/config.yaml"
module dna_seq:
snakefile:
"https://github.com/snakemake-workflows/dna-seq-gatk-variant-calling/raw/v2.0.1/Snakefile"
config:
config
use rule * from dna_seq
# easily extend the workflow
rule plot_vafs:
input:
"filtered/all.vcf.gz"
output:
"results/plots/vafs.svg"
notebook:
"notebooks/plot-vafs.py.ipynb"
Using and combining workflows
configfile: "config/config.yaml"
module dna_seq:
snakefile:
"https://github.com/snakemake-workflows/dna-seq-gatk-variant-calling/raw/v2.0.1/Snakefile"
config:
config["dna-seq"]
use rule * from dna_seq as dna_seq_*
# easily extend the workflow
rule plot_vafs:
input:
"filtered/all.vcf.gz"
output:
"results/plots/vafs.svg"
notebook:
"notebooks/plot-vafs.py.ipynb"
module rna_seq:
snakefile:
"https://github.com/snakemake-workflows/rna-seq-kallisto-sleuth/raw/v1.0.0/Snakefile"
config:
config["rna-seq"]
use rule * from rna_seq as rna_seq_*
Data analysis
- automation
- scalability
- portability
- readability
- documentation
- traceability
Reproducibility
Transparency
- readability
- portability
- scalability
Adaptability
job selection
job resource usage
free resources
job temp file consumption
temp file lifetime fraction
job priority
job thread usage
Scheduling
temp file size
temp file deletion
DAG partitioning
--groups a=g1 b=g1
--groups a=g1 b=g1 --group-components g1=2
--groups a=g1 b=g1 --group-components g1=5
Scalable to any platform
workstation
compute server
cluster
grid computing
cloud computing
Between workflow caching
dataset
results
dataset
dataset
dataset
dataset
dataset
shared data
Between workflow caching
Data analysis
- automation
- scalability
- portability
- readability
- documentation
- traceability
Reproducibility
Transparency
- readability
- portability
- scalability
Adaptability
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
conda:
"envs/some-tool.yaml"
shell:
"some-tool {input} > {output}"
Conda integration
channels:
- conda-forge
dependencies:
- some-tool =2.3.1
- some-lib =1.1.2
Container integration
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
container:
"docker://biocontainers/some-tool#2.3.1"
shell:
"some-tool {input} > {output}"
Containerization
containerized:
"docker://username/myworkflow:1.0.0"
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
conda:
"envs/some-tool.yaml"
shell:
"some-tool {input} > {output}"
snakemake --containerize > Dockerfile
Data analysis
- automation
- scalability
- portability
- readability
- documentation
- traceability
Reproducibility
Transparency
- readability
- portability
- scalability
Adaptability
Self-contained HTML reports
Snakemake workflow catalog
Features
- human readable language
- ad-hoc script integration
- jupyter notebook integration
- high scalability
- caching of shared resources
- Conda and container integration
- modularization (wrappers, workflows)
- data-dependent conditional execution
- streaming/piping between jobs
- service jobs (providing a shared memory device or a database)
- helpers for scatter-gather
- helpers for parameter exploration
- integrated benchmarking
- remote file support (S3, FTP, Zenodo, ...)
Snakemake highlights (15min)
By Johannes Köster
Snakemake highlights (15min)
- 740