Johannes Köster
Joint Meeting of the German
Research Training Groups
Dagstuhl 2023
dataset
results
"Let me do that by hand..."
dataset
results
dataset
dataset
dataset
dataset
dataset
"Let me do that by hand..."
workflow management
Reproducibility
Transparency
Adaptability
scientific workflows
scientific software
package management
source("https://bioconductor.org/biocLite.R")
biocLite("DESeq2")
easy_install snakemake
./configure --prefix=/usr/local
make
make install
cp lib/amd64/jli/*.so lib
cp lib/amd64/*.so lib
cp * $PREFIX
cpan -i bioperl
cmake ../../my_project \
-DCMAKE_MODULE_PATH=~/devel/seqan/util/cmake \
-DSEQAN_INCLUDE_PATH=~/devel/seqan/include
make
make install
apt-get install bwa
yum install python-h5py
install.packages("matrixpls")
package:
name: seqtk
version: 1.2
source:
fn: v1.2.tar.gz
url: https://github.com/lh3/seqtk/archive/v1.2.tar.gz
requirements:
build:
- gcc
- zlib
run:
- zlib
about:
home: https://github.com/lh3/seqtk
license: MIT License
summary: Seqtk is a fast and lightweight tool for processing sequences
test:
commands:
- seqtk seq
#!/bin/bash
export C_INCLUDE_PATH=${PREFIX}/include
export LIBRARY_PATH=${PREFIX}/lib
make all
mkdir -p $PREFIX/bin
cp seqtk $PREFIX/bin
Easy installation and maintenance:
no admin rights needed
mamba env create -f myenv.yaml -n myenv
Isolated environments:
channels:
- conda-forge
- nodefaults
dependencies:
- pandas ==0.20.3
- statsmodels ==0.8.0
- r-dplyr ==0.7.0
- r-base ==3.4.1
- python ==3.6.0
workflow management
Snakemake:
>700k downloads since 2015
>2000 citations
>10 citations per week in 2022
dataset
results
dataset
dataset
dataset
dataset
dataset
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
script:
"scripts/myscript.R"
rule myfiltration:
input:
"result/{dataset}.txt"
output:
"result/{dataset}.filtered.txt"
shell:
"mycommand {input} > {output}"
rule aggregate:
input:
"results/dataset1.filtered.txt",
"results/dataset2.filtered.txt"
output:
"plots/myplot.pdf"
script:
"scripts/myplot.R"
rule mytask:
input:
"data/{sample}.txt"
output:
"result/{sample}.txt"
conda:
"software-envs/some-tool.yaml"
shell:
"some-tool {input} > {output}"
rule name
how to create output from input
declare software environment
rule mytask:
input:
"data/{sample}.txt"
output:
"result/{sample}.txt"
script:
"scripts/myscript.py"
reusable scripts:
import pandas as pd
data = pd.read_table(snakemake.input[0])
data = data.sort_values("id")
data.to_csv(snakemake.output[0], sep="\t")
Python:
data <- read.table(snakemake@input[[1]])
data <- data[order(data$id),]
write.table(data, file = snakemake@output[[1]])
R:
import polar as pl
pl.read_csv(&snakemake.input[0])
.sort()
.to_csv(&snakemake.output[0])
Rust:
+
scale to any platform without adapting workflow
"The Reproducibility Project: Cancer Biology has so far managed to replicate the main findings in only 5 of 17 highly cited articles, and a replication of 21 social-sciences articles in Science and Nature had a success rate of between 57 and 67%."
Amaral & Neves, Nature 2021
Problem:
Example Nature:
Problem:
How to connect results (i.e. figures and tables) with code, parameters, used software?
Of 47 open-access publications, ...
Many resources are available, we just have to use them!
def make_complex(*args):
x, y = args
return dict(**locals())
def make_complex(x, y):
return {'x': x, 'y': y}
vs
Automatic code analysis tools:
Instead:
Automatic code analysis tools:
e.g. Lizard (https://github.com/terryyin/lizard)
data = dict()
with open("somefile.tsv") as f:
for line in f:
line = line.split("\t")
key, fields = line[0], line[1:]
data[key] = [float(field) for field in fields]
import pandas as pd
data = pd.read_table("sometable.tsv", index_col=0)
vs
Goal:
efficient, error-free, readable code
Classical compiled languages (C, C++, ...):
Wikipedia
Scripting languages (Python, Ruby, ...):
https://attractivechaos.github.io/plb/
Result:
Github Actions
jobs:
Formatting:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
with:
# Full git history is needed to get a proper
# list of changed files within `super-linter`
fetch-depth: 0
- name: Formatting
uses: github/super-linter@v4
env:
VALIDATE_ALL_CODEBASE: false
DEFAULT_BRANCH: master
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
VALIDATE_SNAKEMAKE_SNAKEFMT: true
Linting:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Lint workflow
uses: snakemake/snakemake-github-action@v1.24.0
with:
directory: .
snakefile: workflow/Snakefile
args: "--lint"
Testing:
runs-on: ubuntu-latest
needs:
- Linting
- Formatting
steps:
- uses: actions/checkout@v2
- name: Test workflow (local FASTQs)
uses: snakemake/snakemake-github-action@v1
with:
directory: .test
snakefile: workflow/Snakefile
args: "--configfile .test/config-simple/config.yaml --use-conda"
reproducibility
+ transparency
+ adaptability
= sustained value for authors and community
a long way to go
but the tools are there
It has never been easier!