Reproducibility and beyond:
towards sustainable data science
Johannes Köster
Joint Meeting of the German
Research Training Groups
Dagstuhl 2023
dataset
results
Data analysis
"Let me do that by hand..."
![](https://s3.amazonaws.com/media-p.slid.es/uploads/362168/images/3236000/1.gif)
dataset
results
dataset
dataset
dataset
dataset
dataset
![](https://s3.amazonaws.com/media-p.slid.es/uploads/362168/images/3236000/1.gif)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/362168/images/3235999/2.gif)
"Let me do that by hand..."
Data analysis
workflow management
- check computational validity
- apply same to new data
- check methodological validity
- understand what was done
Data analysis
Reproducibility
Transparency
- modify
- extend
Adaptability
scientific workflows
scientific software
scientific workflows
scientific workflows
package management
Software installation is heterogeneous
source("https://bioconductor.org/biocLite.R")
biocLite("DESeq2")
easy_install snakemake
./configure --prefix=/usr/local
make
make install
cp lib/amd64/jli/*.so lib
cp lib/amd64/*.so lib
cp * $PREFIX
cpan -i bioperl
cmake ../../my_project \
-DCMAKE_MODULE_PATH=~/devel/seqan/util/cmake \
-DSEQAN_INCLUDE_PATH=~/devel/seqan/include
make
make install
apt-get install bwa
yum install python-h5py
install.packages("matrixpls")
Package management with Conda/Mamba
package:
name: seqtk
version: 1.2
source:
fn: v1.2.tar.gz
url: https://github.com/lh3/seqtk/archive/v1.2.tar.gz
requirements:
build:
- gcc
- zlib
run:
- zlib
about:
home: https://github.com/lh3/seqtk
license: MIT License
summary: Seqtk is a fast and lightweight tool for processing sequences
test:
commands:
- seqtk seq
#!/bin/bash
export C_INCLUDE_PATH=${PREFIX}/include
export LIBRARY_PATH=${PREFIX}/lib
make all
mkdir -p $PREFIX/bin
cp seqtk $PREFIX/bin
- source or binary
- recipe and build script
- package
Easy installation and maintenance:
no admin rights needed
mamba env create -f myenv.yaml -n myenv
Isolated environments:
channels:
- conda-forge
- nodefaults
dependencies:
- pandas ==0.20.3
- statsmodels ==0.8.0
- r-dplyr ==0.7.0
- r-base ==3.4.1
- python ==3.6.0
Package management with Conda/Mamba
- >8000 bioinformatics related packages (C, C++, Python, R, Perl, ...)
- >140 million downloads
- >650 contributors
![](https://s3.amazonaws.com/media-p.slid.es/uploads/362168/images/7667604/paper.png)
Alternatives
- Nix
- Spack
- Containers
scientific workflows
workflow management
![](https://s3.amazonaws.com/media-p.slid.es/uploads/362168/images/2646163/nature_genetics.gif)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/362168/images/2646162/cancer_cell.jpg)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/362168/images/2646179/nature_methods.gif)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/362168/images/2646177/genome_biology.jpg)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/362168/images/2646170/bioinformatics.gif)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/362168/images/2646184/cell.jpg)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/362168/images/2646193/molecular_cell.jpg)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/362168/images/2646200/plos_genetics.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/362168/images/5293194/ng50_7.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/362168/images/5293229/nar_46_12.gif)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/362168/images/5293228/isme_12_8.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/362168/images/5293223/nm_3_2.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/362168/images/5293204/nar_46_14.gif)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/362168/images/5293277/msb_13_5.jpg)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/362168/images/5293330/mbe_34_5.jpg)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/362168/images/5293331/pnas_114_5.jpg)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/362168/images/5293364/nbt-v34-n5.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/362168/images/5293426/molcel_66_3.gif)
Snakemake:
>700k downloads since 2015
>2000 citations
>10 citations per week in 2022
Workflow management
![](https://s3.amazonaws.com/media-p.slid.es/uploads/362168/images/10520720/pasted-from-clipboard.png)
dataset
results
dataset
dataset
dataset
dataset
dataset
Define workflows
in terms of rules
Define workflows
in terms of rules
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
script:
"scripts/myscript.R"
rule myfiltration:
input:
"result/{dataset}.txt"
output:
"result/{dataset}.filtered.txt"
shell:
"mycommand {input} > {output}"
rule aggregate:
input:
"results/dataset1.filtered.txt",
"results/dataset2.filtered.txt"
output:
"plots/myplot.pdf"
script:
"scripts/myplot.R"
Define workflows
in terms of rules
Define workflows
in terms of rules
rule mytask:
input:
"data/{sample}.txt"
output:
"result/{sample}.txt"
conda:
"software-envs/some-tool.yaml"
shell:
"some-tool {input} > {output}"
rule name
how to create output from input
declare software environment
Boilerplate-free integration of scripts
rule mytask:
input:
"data/{sample}.txt"
output:
"result/{sample}.txt"
script:
"scripts/myscript.py"
reusable scripts:
- Python
- R
- Julia
- Rust
import pandas as pd
data = pd.read_table(snakemake.input[0])
data = data.sort_values("id")
data.to_csv(snakemake.output[0], sep="\t")
Python:
data <- read.table(snakemake@input[[1]])
data <- data[order(data$id),]
write.table(data, file = snakemake@output[[1]])
Boilerplate-free integration of scripts
R:
import polar as pl
pl.read_csv(&snakemake.input[0])
.sort()
.to_csv(&snakemake.output[0])
Rust:
Directed acyclic graph (DAG) of jobs
- MILP based scheduler
- graph partitioning
+
scale to any platform without adapting workflow
Alternatives
- Nextflow
- Galaxy
- Airflow
- KNIME
- Hadoop
- ...
Publication reproducibility
"The Reproducibility Project: Cancer Biology has so far managed to replicate the main findings in only 5 of 17 highly cited articles, and a replication of 21 social-sciences articles in Science and Nature had a success rate of between 57 and 67%."
Amaral & Neves, Nature 2021
Problem:
- rarely a topic in journal guidelines
- how to review?
- delegate to community (review or reproduction itself)?
- what could be incentives for reproducing the work of others?
![](https://s3.amazonaws.com/media-p.slid.es/uploads/362168/images/3236172/nature-guidelines.png)
Example Nature:
Publication reproducibility
Publication transparency
Problem:
How to connect results (i.e. figures and tables) with code, parameters, used software?
Snakemake reports
scientific software
The crisis of scientific software
- incomplete documentation
- inefficient programming
- incomplete testing
- little to no maintenance
Example
ISMB 2016 "Wall of Shame"
![](https://s3.amazonaws.com/media-p.slid.es/uploads/362168/images/3236446/ISMB-wall-of-shame.png)
Of 47 open-access publications, ...
Reasons
- temporary employments
- time pressure
- code itself, software quality not a widely recognized scientific output
- PIs occupied with other duties
But
Many resources are available, we just have to use them!
Style guides
def make_complex(*args):
x, y = args
return dict(**locals())
def make_complex(x, y):
return {'x': x, 'y': y}
vs
Automatic code analysis tools:
- linter: e.g. Ruff (https://github.com/charliermarsh/ruff)
- auto-formatter: e.g. Black (https://github.com/psf/black)
Code duplication and complexity
Instead:
- inheritance
- delegation
- data classes
- singletons
Automatic code analysis tools:
e.g. Lizard (https://github.com/terryyin/lizard)
Reinventing wheels
data = dict()
with open("somefile.tsv") as f:
for line in f:
line = line.split("\t")
key, fields = line[0], line[1:]
data[key] = [float(field) for field in fields]
import pandas as pd
data = pd.read_table("sometable.tsv", index_col=0)
vs
The choice of the programming language
Goal:
efficient, error-free, readable code
Classical compiled languages (C, C++, ...):
- very efficient
- complex and error prone (memory management, thread safety, ...)
Wikipedia
The choice of the programming language
Scripting languages (Python, Ruby, ...):
- slow
- no type safety
- only runtime errors
![](https://s3.amazonaws.com/media-p.slid.es/uploads/362168/images/3240945/lang-bench.png)
https://attractivechaos.github.io/plb/
The choice of the programming language
Rust
- simplified, C++-like syntax
- automatic type inference
- elements from scripting languages
- very strict compiler with guaranteed memory and thread safety
- elected "most loved programming language" on stack overflow since 8 years
Result:
- Development time shifts from debugging to compiling.
- massively reduced maintenance effort
![](https://s3.amazonaws.com/media-p.slid.es/uploads/362168/images/3237028/rust-bio.png)
Continuous integration
Github Actions
jobs:
Formatting:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
with:
# Full git history is needed to get a proper
# list of changed files within `super-linter`
fetch-depth: 0
- name: Formatting
uses: github/super-linter@v4
env:
VALIDATE_ALL_CODEBASE: false
DEFAULT_BRANCH: master
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
VALIDATE_SNAKEMAKE_SNAKEFMT: true
Linting:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Lint workflow
uses: snakemake/snakemake-github-action@v1.24.0
with:
directory: .
snakefile: workflow/Snakefile
args: "--lint"
Testing:
runs-on: ubuntu-latest
needs:
- Linting
- Formatting
steps:
- uses: actions/checkout@v2
- name: Test workflow (local FASTQs)
uses: snakemake/snakemake-github-action@v1
with:
directory: .test
snakefile: workflow/Snakefile
args: "--configfile .test/config-simple/config.yaml --use-conda"
![](https://s3.amazonaws.com/media-p.slid.es/uploads/362168/images/10524962/pasted-from-clipboard.png)
Conclusion
reproducibility
+ transparency
+ adaptability
= sustained value for authors and community
a long way to go
but the tools are there
It has never been easier!
deck
By Johannes Köster
deck
Keynote at Dagstuhl 2023
- 536