Johannes Köster
2023
dataset
results
"Let me do that by hand..."
dataset
results
dataset
dataset
dataset
dataset
dataset
"Let me do that by hand..."
Reproducibility
Transparency
Adaptability
>700k downloads since 2015
>2000 citations
>10 citations per week in 2022
Reproducibility
Transparency
Adaptability
Reproducibility
Transparency
Adaptability
dataset
results
dataset
dataset
dataset
dataset
dataset
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
script:
"scripts/myscript.R"
rule myfiltration:
input:
"result/{dataset}.txt"
output:
"result/{dataset}.filtered.txt"
shell:
"mycommand {input} > {output}"
rule aggregate:
input:
"results/dataset1.filtered.txt",
"results/dataset2.filtered.txt"
output:
"plots/myplot.pdf"
script:
"scripts/myplot.R"
rule mytask:
input:
"data/{sample}.txt"
output:
"result/{sample}.txt"
shell:
"some-tool {input} > {output}"
rule name
how to create output from input
define
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
script:
"scripts/myscript.R"
rule myfiltration:
input:
"result/{dataset}.txt"
output:
"result/{dataset}.filtered.txt"
shell:
"mycommand {input} > {output}"
rule aggregate:
input:
"results/dataset1.filtered.txt",
"results/dataset2.filtered.txt"
output:
"plots/myplot.pdf"
script:
"scripts/myplot.R"
rule mytask:
input:
"data/{sample}.txt"
output:
"result/{sample}.txt"
script:
"scripts/myscript.py"
reusable scripts:
import pandas as pd
data = pd.read_table(snakemake.input[0])
data = data.sort_values("id")
data.to_csv(snakemake.output[0], sep="\t")
Python:
data <- read.table(snakemake@input[[1]])
data <- data[order(data$id),]
write.table(data, file = snakemake@output[[1]])
R:
import polar as pl
pl.read_csv(&snakemake.input[0])
.sort()
.to_csv(&snakemake.output[0])
Rust:
rule mytask:
input:
"data/{sample}.txt"
output:
"result/{sample}.txt"
notebook:
"notebooks/mynotebook.ipynb"
rule map_reads:
input:
"{sample}.bam"
output:
"{sample}.sorted.bam"
wrapper:
"0.22.0/bio/samtools/sort"
reuseable wrappers from central repository
rule mytask:
input:
"data/{sample}.txt"
output:
temp("result/{sample}.txt")
shell:
"some-tool {input} > {output}"
rule mytask:
input:
"data/{sample}.txt"
output:
protected("result/{sample}.txt")
shell:
"some-tool {input} > {output}"
rule mytask:
input:
"data/{sample}.txt"
output:
pipe("result/{sample}.txt")
shell:
"some-tool {input} > {output}"
configfile: "config/config.yaml"
module dna_seq:
snakefile:
github("snakemake-workflows/dna-seq-gatk-variant-calling", path="workflow/Snakefile", tag="v1.17.0")
config:
config
use rule * from dna_seq
# easily extend the workflow
rule plot_vafs:
input:
"filtered/all.vcf.gz"
output:
"results/plots/vafs.svg"
notebook:
"notebooks/plot-vafs.py.ipynb"
configfile: "config/config.yaml"
module dna_seq:
snakefile:
github("snakemake-workflows/dna-seq-gatk-variant-calling", path="workflow/Snakefile", tag="v1.17.0")
config:
config["dna-seq"]
use rule * from dna_seq as dna_seq_*
# easily extend the workflow
rule plot_vafs:
input:
"filtered/all.vcf.gz"
output:
"results/plots/vafs.svg"
notebook:
"notebooks/plot-vafs.py.ipynb"
module rna_seq:
snakefile:
github("snakemake-workflows/rna-seq-kallisto-sleuth", path="workflow/Snakefile", tag="v1.0.0")
config:
config["rna-seq"]
use rule * from rna_seq as rna_seq_*
Reproducibility
Transparency
Adaptability
job selection
job resource usage
free resources
job temp file consumption
temp file lifetime fraction
job priority
job thread usage
temp file size
temp file deletion
--groups a=g1 b=g1
--groups a=g1 b=g1 --group-components g1=2
--groups a=g1 b=g1 --group-components g1=5
workstation
compute server
cluster
grid computing
cloud computing
dataset
results
dataset
dataset
dataset
dataset
dataset
shared data
Reproducibility
Transparency
Adaptability
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
conda:
"envs/some-tool.yaml"
shell:
"some-tool {input} > {output}"
channels:
- conda-forge
dependencies:
- some-tool =2.3.1
- some-lib =1.1.2
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
container:
"docker://biocontainers/some-tool#2.3.1"
shell:
"some-tool {input} > {output}"
containerized:
"docker://username/myworkflow:1.0.0"
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
conda:
"envs/some-tool.yaml"
shell:
"some-tool {input} > {output}"
snakemake --containerize > Dockerfile
Reproducibility
Transparency
Adaptability
Input:
Output:
name: My oscar report
default-view: oscars
datasets:
oscars:
path: "data/oscars.csv"
links:
link to oscar plot:
column: age
view: oscar-plot
link to movie:
column: movie
table-row: movies/Title
movies:
path: "data/movies.csv"
links:
link to oscar entry:
column: Title
table-row: oscars/movie
views:
oscars:
dataset: oscars
desc: |
### All winning oscars beginning in the year 1929.
This table contains *all* winning oscars for best actress and best actor.
page-size: 25
render-table:
columns:
age:
plot:
ticks:
scale: linear
domain:
- 20
- 100
name:
link-to-url: "https://lmgtfy.app/?q=Is {name} in {movie}?"
movie:
link-to-url: "https://de.wikipedia.org/wiki/{value}"
award:
plot:
heatmap:
scale: ordinal
domain:
- Best actor
- Best actress
range:
- "#add8e6"
- "#ffb6c1"
index(0):
display-mode: hidden
regex('birth_.+'):
display-mode: detail
movies:
dataset: movies
render-table:
columns:
Genre:
ellipsis: 15
imdbID:
link-to-url: "https://www.imdb.com/title/{value}/"
Title:
link-to-url: "https://de.wikipedia.org/wiki/{value}"
imdbRating:
precision: 1
plot:
bars:
scale: linear
domain:
- 1
- 10
Rated:
plot-view-legend: true
plot:
heatmap:
scale: ordinal
color-scheme: accent
oscar-plot:
dataset: oscars
desc: |
## My beautiful oscar scatter plot
*So many great actors and actresses*
render-plot:
spec-path: ".examples/specs/oscars.vl.json"
movies-plot:
dataset: movies
desc: |
All movies with its *runtime* and *ratings* plotted over *time*.
render-plot:
spec-path: ".examples/specs/movies.vl.json"
+
Snakemake covers all aspects of fully reproducible, transparent, and adaptable data analysis.
Datavzrd can be used to rapidly obtain visual interactive interfaces to tabular results.
https://snakemake.github.io
https://github.com/datavzrd/datavzrd
transforming towards a plugin architecture
backed by
developed by
language updates
import pandas as pd
samples = pd.read_csv("samples.tsv", sep="\t")
configfile: "config.yaml"
rule all:
input:
"results/switch~someswitch.column~sample.txt",
rule a:
output:
"a/{sample}.txt",
shell:
"echo a > {output}"
rule b:
input:
branch(evaluate("{sample} == '100'"), then="a/{sample}.txt"),
output:
"b/{sample}.txt",
shell:
"echo b > {output}"
rule c:
input:
branch(
evaluate("{sample} == '1'"),
then="a/{sample}.txt",
otherwise="b/{sample}.txt",
),
output:
"c/{sample}.txt",
shell:
"cat {input} > {output}"
rule d:
output:
"test.txt",
shell:
"echo d > {output}"
rule e:
input:
collect("c/{item.sample}.txt", item=lookup(query="{col} <= 2", within=samples)),
branch(lookup(dpath="switches/{switch}", within=config), then="test.txt"),
output:
"results/switch~{switch}.column~{col}.txt",
shell:
"cat {input} > {output}"