Reproducibility and beyond:
towards sustainable data science
Johannes Köster
Joint Meeting of the German
Research Training Groups
Dagstuhl 2023
dataset
results
Data analysis
"Let me do that by hand..."
data:image/s3,"s3://crabby-images/ba543/ba5435e9a7955fdc626e8865aca836d67ca461f1" alt=""
dataset
results
dataset
dataset
dataset
dataset
dataset
data:image/s3,"s3://crabby-images/ba543/ba5435e9a7955fdc626e8865aca836d67ca461f1" alt=""
data:image/s3,"s3://crabby-images/38c34/38c3472b84993a758181302f68ce26fe7736acf2" alt=""
"Let me do that by hand..."
Data analysis
workflow management
- check computational validity
- apply same to new data
- check methodological validity
- understand what was done
Data analysis
Reproducibility
Transparency
- modify
- extend
Adaptability
scientific workflows
scientific software
scientific workflows
scientific workflows
package management
Software installation is heterogeneous
source("https://bioconductor.org/biocLite.R")
biocLite("DESeq2")
easy_install snakemake
./configure --prefix=/usr/local
make
make install
cp lib/amd64/jli/*.so lib
cp lib/amd64/*.so lib
cp * $PREFIX
cpan -i bioperl
cmake ../../my_project \
-DCMAKE_MODULE_PATH=~/devel/seqan/util/cmake \
-DSEQAN_INCLUDE_PATH=~/devel/seqan/include
make
make install
apt-get install bwa
yum install python-h5py
install.packages("matrixpls")
Package management with Conda/Mamba
package:
name: seqtk
version: 1.2
source:
fn: v1.2.tar.gz
url: https://github.com/lh3/seqtk/archive/v1.2.tar.gz
requirements:
build:
- gcc
- zlib
run:
- zlib
about:
home: https://github.com/lh3/seqtk
license: MIT License
summary: Seqtk is a fast and lightweight tool for processing sequences
test:
commands:
- seqtk seq
#!/bin/bash
export C_INCLUDE_PATH=${PREFIX}/include
export LIBRARY_PATH=${PREFIX}/lib
make all
mkdir -p $PREFIX/bin
cp seqtk $PREFIX/bin
- source or binary
- recipe and build script
- package
Easy installation and maintenance:
no admin rights needed
mamba env create -f myenv.yaml -n myenv
Isolated environments:
channels:
- conda-forge
- nodefaults
dependencies:
- pandas ==0.20.3
- statsmodels ==0.8.0
- r-dplyr ==0.7.0
- r-base ==3.4.1
- python ==3.6.0
Package management with Conda/Mamba
- >8000 bioinformatics related packages (C, C++, Python, R, Perl, ...)
- >140 million downloads
- >650 contributors
data:image/s3,"s3://crabby-images/255c5/255c5ad4e21f291adb7c933d0262b2e9297e84e6" alt=""
Alternatives
- Nix
- Spack
- Containers
scientific workflows
workflow management
data:image/s3,"s3://crabby-images/3c022/3c022b4698b2360a655917f4f5b720c741f14114" alt=""
data:image/s3,"s3://crabby-images/7da29/7da29cc735d47bfff23f89c9c8ae8e21aefaa601" alt=""
data:image/s3,"s3://crabby-images/74097/740974c2e869278eafb80a07c335a40c1c77fda6" alt=""
data:image/s3,"s3://crabby-images/d9380/d9380443ef7569e0c37c180a72cf3c30b17460b3" alt=""
data:image/s3,"s3://crabby-images/be561/be56185e0bf4027cd98eec390a4235ee813c7409" alt=""
data:image/s3,"s3://crabby-images/e6462/e6462f50c3fc179aec5edaa00559eec86a57cc3e" alt=""
data:image/s3,"s3://crabby-images/210e1/210e1993f3819917ec19559459880c78791e246b" alt=""
data:image/s3,"s3://crabby-images/fe41d/fe41dd38b98115f4b6325751e6c4e5b3c06a6525" alt=""
data:image/s3,"s3://crabby-images/7b85e/7b85efb489a2fc4efda11cd32020cdce43e1447a" alt=""
data:image/s3,"s3://crabby-images/73f11/73f118d8b57f2eb89594c1d0fbcbb2f1900f5d41" alt=""
data:image/s3,"s3://crabby-images/d0360/d03601124c70f2c42a6f5ca3a19e5b077a280e8c" alt=""
data:image/s3,"s3://crabby-images/c9545/c95452f581efe946ce158b5fcd170762ab4b245d" alt=""
data:image/s3,"s3://crabby-images/f3aeb/f3aeb35b038b3d057853ad4ccfaa83556839e7f5" alt=""
data:image/s3,"s3://crabby-images/63a69/63a692e693abceb537e399c5c58c54cf7f6853e0" alt=""
data:image/s3,"s3://crabby-images/d4b4c/d4b4c22c1a600dfe9799975bd0ed683ba42f1131" alt=""
data:image/s3,"s3://crabby-images/bd7f7/bd7f72c0ce7a63f512926999cde75df644704d9e" alt=""
data:image/s3,"s3://crabby-images/c1dba/c1dbad4b68e3b8132a29779d7aa280144a34a2ae" alt=""
data:image/s3,"s3://crabby-images/13c55/13c5511017697a0980cdefdbd4d57bc929faa16e" alt=""
Snakemake:
>700k downloads since 2015
>2000 citations
>10 citations per week in 2022
Workflow management
data:image/s3,"s3://crabby-images/9b7bb/9b7bb0cf043bc58151e0b4da6b6ab4d4da875060" alt=""
dataset
results
dataset
dataset
dataset
dataset
dataset
Define workflows
in terms of rules
Define workflows
in terms of rules
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
script:
"scripts/myscript.R"
rule myfiltration:
input:
"result/{dataset}.txt"
output:
"result/{dataset}.filtered.txt"
shell:
"mycommand {input} > {output}"
rule aggregate:
input:
"results/dataset1.filtered.txt",
"results/dataset2.filtered.txt"
output:
"plots/myplot.pdf"
script:
"scripts/myplot.R"
Define workflows
in terms of rules
Define workflows
in terms of rules
rule mytask:
input:
"data/{sample}.txt"
output:
"result/{sample}.txt"
conda:
"software-envs/some-tool.yaml"
shell:
"some-tool {input} > {output}"
rule name
how to create output from input
declare software environment
Boilerplate-free integration of scripts
rule mytask:
input:
"data/{sample}.txt"
output:
"result/{sample}.txt"
script:
"scripts/myscript.py"
reusable scripts:
- Python
- R
- Julia
- Rust
import pandas as pd
data = pd.read_table(snakemake.input[0])
data = data.sort_values("id")
data.to_csv(snakemake.output[0], sep="\t")
Python:
data <- read.table(snakemake@input[[1]])
data <- data[order(data$id),]
write.table(data, file = snakemake@output[[1]])
Boilerplate-free integration of scripts
R:
import polar as pl
pl.read_csv(&snakemake.input[0])
.sort()
.to_csv(&snakemake.output[0])
Rust:
Directed acyclic graph (DAG) of jobs
- MILP based scheduler
- graph partitioning
+
scale to any platform without adapting workflow
Alternatives
- Nextflow
- Galaxy
- Airflow
- KNIME
- Hadoop
- ...
Publication reproducibility
"The Reproducibility Project: Cancer Biology has so far managed to replicate the main findings in only 5 of 17 highly cited articles, and a replication of 21 social-sciences articles in Science and Nature had a success rate of between 57 and 67%."
Amaral & Neves, Nature 2021
Problem:
- rarely a topic in journal guidelines
- how to review?
- delegate to community (review or reproduction itself)?
- what could be incentives for reproducing the work of others?
data:image/s3,"s3://crabby-images/93b61/93b61b0ba092a01550166f64ba59fbfee917ab00" alt=""
Example Nature:
Publication reproducibility
Publication transparency
Problem:
How to connect results (i.e. figures and tables) with code, parameters, used software?
Snakemake reports
scientific software
The crisis of scientific software
- incomplete documentation
- inefficient programming
- incomplete testing
- little to no maintenance
Example
ISMB 2016 "Wall of Shame"
data:image/s3,"s3://crabby-images/eb83b/eb83bc65ab4e111536118ef18a5f821d9955fafa" alt=""
Of 47 open-access publications, ...
Reasons
- temporary employments
- time pressure
- code itself, software quality not a widely recognized scientific output
- PIs occupied with other duties
But
Many resources are available, we just have to use them!
Style guides
def make_complex(*args):
x, y = args
return dict(**locals())
def make_complex(x, y):
return {'x': x, 'y': y}
vs
Automatic code analysis tools:
- linter: e.g. Ruff (https://github.com/charliermarsh/ruff)
- auto-formatter: e.g. Black (https://github.com/psf/black)
Code duplication and complexity
Instead:
- inheritance
- delegation
- data classes
- singletons
Automatic code analysis tools:
e.g. Lizard (https://github.com/terryyin/lizard)
Reinventing wheels
data = dict()
with open("somefile.tsv") as f:
for line in f:
line = line.split("\t")
key, fields = line[0], line[1:]
data[key] = [float(field) for field in fields]
import pandas as pd
data = pd.read_table("sometable.tsv", index_col=0)
vs
The choice of the programming language
Goal:
efficient, error-free, readable code
Classical compiled languages (C, C++, ...):
- very efficient
- complex and error prone (memory management, thread safety, ...)
Wikipedia
The choice of the programming language
Scripting languages (Python, Ruby, ...):
- slow
- no type safety
- only runtime errors
data:image/s3,"s3://crabby-images/e7829/e78294bb8640443e6593cae1dead7191c5073670" alt=""
https://attractivechaos.github.io/plb/
The choice of the programming language
Rust
- simplified, C++-like syntax
- automatic type inference
- elements from scripting languages
- very strict compiler with guaranteed memory and thread safety
- elected "most loved programming language" on stack overflow since 8 years
Result:
- Development time shifts from debugging to compiling.
- massively reduced maintenance effort
data:image/s3,"s3://crabby-images/24464/244644d3272d237d571114816d938105eac7ed19" alt=""
Continuous integration
Github Actions
jobs:
Formatting:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
with:
# Full git history is needed to get a proper
# list of changed files within `super-linter`
fetch-depth: 0
- name: Formatting
uses: github/super-linter@v4
env:
VALIDATE_ALL_CODEBASE: false
DEFAULT_BRANCH: master
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
VALIDATE_SNAKEMAKE_SNAKEFMT: true
Linting:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Lint workflow
uses: snakemake/snakemake-github-action@v1.24.0
with:
directory: .
snakefile: workflow/Snakefile
args: "--lint"
Testing:
runs-on: ubuntu-latest
needs:
- Linting
- Formatting
steps:
- uses: actions/checkout@v2
- name: Test workflow (local FASTQs)
uses: snakemake/snakemake-github-action@v1
with:
directory: .test
snakefile: workflow/Snakefile
args: "--configfile .test/config-simple/config.yaml --use-conda"
data:image/s3,"s3://crabby-images/f68bd/f68bd721d5942aad86da549a1dcf00701c15ee93" alt=""
Conclusion
reproducibility
+ transparency
+ adaptability
= sustained value for authors and community
a long way to go
but the tools are there
It has never been easier!
deck
By Johannes Köster
deck
Keynote at Dagstuhl 2023
- 555