SPHN Workflow Interoperability Workshop
Johannes Köster
2018
https://koesterlab.github.io
dataset
results
"Let me do that by hand..."
dataset
results
dataset
dataset
dataset
dataset
dataset
"Let me do that by hand..."
dataset
results
dataset
dataset
dataset
dataset
dataset
automation
From raw data to final figures:
dataset
results
dataset
dataset
dataset
dataset
dataset
scalability
Handle parallelization:
automation
dataset
results
dataset
dataset
dataset
dataset
dataset
Handle deployment:
be able to easily execute analyses on a different system/platform/infrastructure
portability
scalability
automation
dataset
results
dataset
dataset
dataset
dataset
dataset
scalability
automation
portability
dataset
results
dataset
dataset
dataset
dataset
dataset
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
script:
"scripts/myscript.R"
rule myfiltration:
input:
"result/{dataset}.txt"
output:
"result/{dataset}.filtered.txt"
shell:
"mycommand {input} > {output}"
rule aggregate:
input:
"results/dataset1.filtered.txt",
"results/dataset2.filtered.txt"
output:
"plots/myplot.pdf"
script:
"scripts/myplot.R"
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
script:
"scripts/myscript.R"
rule myfiltration:
input:
"result/{dataset}.txt"
output:
"result/{dataset}.filtered.txt"
shell:
"mycommand {input} > {output}"
rule aggregate:
input:
"results/dataset1.filtered.txt",
"results/dataset2.filtered.txt"
output:
"plots/myplot.pdf"
script:
"scripts/myplot.R"
rule mytask:
input:
"data/{sample}.txt"
output:
"result/{sample}.txt"
shell:
"some-tool {input} > {output}"
rule name
how to create output from input
define
rule mytask:
input:
"data/{sample}.txt"
output:
"result/{sample}.txt"
script:
"scripts/myscript.py"
reusable
Python/R scripts
rule map_reads:
input:
"{sample}.bam"
output:
"{sample}.sorted.bam"
wrapper:
"0.22.0/bio/samtools/sort"
reuseable wrappers from central repository
use CWL tool
definitions
rule map_reads:
input:
"{sample}.bam"
output:
"{sample}.sorted.bam"
cwl:
"https://github.com/common-workflow-language/"
"workflows/blob/fb406c95/tools/samtools-sort.cwl"
dataset
results
dataset
dataset
dataset
dataset
dataset
scalability
automation
portability
Paradigm:
Workflow definition shall be independent of computing platform and available resources
Rules:
define resource usage (threads, memory, ...)
Scheduler:
workstation
compute server
cluster
grid computing
cloud computing
# execute workflow locally with 16 CPU cores
snakemake --cores 16
# execute on cluster
snakemake --cluster qsub --jobs 100
# execute in the cloud
snakemake --kubernetes --jobs 1000 --default-remote-provider GS --default-remote-prefix mybucket
snakemake --profile slurm --jobs 1000
$HOME/.config/snakemake/slurm
├── config.yaml
├── slurm-jobscript.sh
├── slurm-status.py
└── slurm-submit.py
dataset
results
dataset
dataset
dataset
dataset
dataset
Full reproducibility:
install required software and all dependencies in exact versions
portability
scalability
automation
source("https://bioconductor.org/biocLite.R")
biocLite("DESeq2")
easy_install snakemake
./configure --prefix=/usr/local
make
make install
cp lib/amd64/jli/*.so lib
cp lib/amd64/*.so lib
cp * $PREFIX
cpan -i bioperl
cmake ../../my_project \
-DCMAKE_MODULE_PATH=~/devel/seqan/util/cmake \
-DSEQAN_INCLUDE_PATH=~/devel/seqan/include
make
make install
apt-get install bwa
yum install python-h5py
install.packages("matrixpls")
package:
name: seqtk
version: 1.2
source:
fn: v1.2.tar.gz
url: https://github.com/lh3/seqtk/archive/v1.2.tar.gz
requirements:
build:
- gcc
- zlib
run:
- zlib
about:
home: https://github.com/lh3/seqtk
license: MIT License
summary: Seqtk is a fast and lightweight tool for processing sequences
test:
commands:
- seqtk seq
Idea:
Normalization installation via recipes
#!/bin/bash
export C_INCLUDE_PATH=${PREFIX}/include
export LIBRARY_PATH=${PREFIX}/lib
make all
mkdir -p $PREFIX/bin
cp seqtk $PREFIX/bin
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
conda:
"envs/mycommand.yaml"
shell:
"mycommand {input} > {output}"
channels:
- conda-forge
- defaults
dependencies:
- mycommand ==2.3.1
Over 3000 bioinformatics related packages
Over 200 contributors
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
singularity:
"docker://biocontainers/mycommand#2.3.1"
shell:
"mycommand {input} > {output}"
singularity:
"docker://continuumio/miniconda3:4.4.1"
rule mytask:
input:
"path/to/{dataset}.txt"
output:
"result/{dataset}.txt"
conda:
"envs/mycommand.yaml"
shell:
"mycommand {input} > {output}"
define OS
define tools/libs
# archive workflow (including Conda packages)
snakemake --archive myworkflow.tar.gz
Author:
Reader:
# execute workflow (Conda packages are deployed automatically)
snakemake --use-conda --cores 16
With
Snakemake covers all three dimensions of fully reproducible data analysis.
portability
scalability
automation