Johannes Köster
REAML Summer School 2022
dataset
results
"Let me do that by hand..."
dataset
results
dataset
dataset
dataset
dataset
dataset
"Let me do that by hand..."
Reproducibility
Transparency
Adaptability
>513k downloads since 2015
>1500 citations
>7 citations per week in 2021
Reproducibility
Transparency
Adaptability
Reproducibility
Transparency
Adaptability
dataset
results
dataset
dataset
dataset
dataset
dataset
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
script:
"scripts/myscript.R"
rule myfiltration:
input:
"result/{dataset}.txt"
output:
"result/{dataset}.filtered.txt"
shell:
"mycommand {input} > {output}"
rule aggregate:
input:
"results/dataset1.filtered.txt",
"results/dataset2.filtered.txt"
output:
"plots/myplot.pdf"
script:
"scripts/myplot.R"
rule mytask:
input:
"data/{sample}.txt"
output:
"result/{sample}.txt"
shell:
"some-tool {input} > {output}"
rule name
how to create output from input
define
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
script:
"scripts/myscript.R"
rule myfiltration:
input:
"result/{dataset}.txt"
output:
"result/{dataset}.filtered.txt"
shell:
"mycommand {input} > {output}"
rule aggregate:
input:
"results/dataset1.filtered.txt",
"results/dataset2.filtered.txt"
output:
"plots/myplot.pdf"
script:
"scripts/myplot.R"
rule mytask:
input:
"data/{sample}.txt"
output:
"result/{sample}.txt"
script:
"scripts/myscript.py"
reusable scripts:
import pandas as pd
data = pd.read_table(snakemake.input[0])
data = data.sort_values("id")
data.to_csv(snakemake.output[0], sep="\t")
Python:
data <- read.table(snakemake@input[[1]])
data <- data[order(data$id),]
write.table(data, file = snakemake@output[[1]])
R:
import polar as pl
pl.read_csv(&snakemake.input[0])
.sort()
.to_csv(&snakemake.output[0])
Rust:
rule mytask:
input:
"data/{sample}.txt"
output:
"result/{sample}.txt"
notebook:
"notebooks/mynotebook.ipynb"
rule map_reads:
input:
"{sample}.bam"
output:
"{sample}.sorted.bam"
wrapper:
"0.22.0/bio/samtools/sort"
reuseable wrappers from central repository
rule mytask:
input:
"data/{sample}.txt"
output:
temp("result/{sample}.txt")
shell:
"some-tool {input} > {output}"
rule mytask:
input:
"data/{sample}.txt"
output:
protected("result/{sample}.txt")
shell:
"some-tool {input} > {output}"
rule mytask:
input:
"data/{sample}.txt"
output:
pipe("result/{sample}.txt")
shell:
"some-tool {input} > {output}"
configfile: "config/config.yaml"
module dna_seq:
snakefile:
"https://github.com/snakemake-workflows/dna-seq-gatk-variant-calling/raw/v2.0.1/Snakefile"
config:
config
use rule * from dna_seq
# easily extend the workflow
rule plot_vafs:
input:
"filtered/all.vcf.gz"
output:
"results/plots/vafs.svg"
notebook:
"notebooks/plot-vafs.py.ipynb"
configfile: "config/config.yaml"
module dna_seq:
snakefile:
"https://github.com/snakemake-workflows/dna-seq-gatk-variant-calling/raw/v2.0.1/Snakefile"
config:
config["dna-seq"]
use rule * from dna_seq as dna_seq_*
# easily extend the workflow
rule plot_vafs:
input:
"filtered/all.vcf.gz"
output:
"results/plots/vafs.svg"
notebook:
"notebooks/plot-vafs.py.ipynb"
module rna_seq:
snakefile:
"https://github.com/snakemake-workflows/rna-seq-kallisto-sleuth/raw/v1.0.0/Snakefile"
config:
config["rna-seq"]
use rule * from rna_seq as rna_seq_*
Reproducibility
Transparency
Adaptability
job selection
job resource usage
free resources
job temp file consumption
temp file lifetime fraction
job priority
job thread usage
temp file size
temp file deletion
--groups a=g1 b=g1
--groups a=g1 b=g1 --group-components g1=2
--groups a=g1 b=g1 --group-components g1=5
workstation
compute server
cluster
grid computing
cloud computing
dataset
results
dataset
dataset
dataset
dataset
dataset
shared data
Reproducibility
Transparency
Adaptability
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
conda:
"envs/some-tool.yaml"
shell:
"some-tool {input} > {output}"
channels:
- conda-forge
dependencies:
- some-tool =2.3.1
- some-lib =1.1.2
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
container:
"docker://biocontainers/some-tool#2.3.1"
shell:
"some-tool {input} > {output}"
containerized:
"docker://username/myworkflow:1.0.0"
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
conda:
"envs/some-tool.yaml"
shell:
"some-tool {input} > {output}"
snakemake --containerize > Dockerfile
Reproducibility
Transparency
Adaptability
https://snakemake.github.io