Reproducibility with Snakemake and Bioconda

Johannes Köster

https://koesterlab.github.io

dataset

results

Data analysis

"Let me do that by hand..."

dataset

results

dataset

"Let me do that by hand..."

Data analysis

dataset

results

dataset

automation

From raw data to final figures:

document parameters, tools, versions
execute without manual intervention

Reproducible data analysis

dataset

results

dataset

scalability

Handle parallelization:

execute for tens to thousands of datasets
efficiently use any computing platform

automation

Reproducible data analysis

dataset

results

dataset

Handle deployment:

be able to easily execute analyses on a different machine

portability

scalability

automation

Reproducible data analysis

dataset

results

dataset

scalability

automation

portability

Genome of the Netherlands:

GoNL consortium. Nature Genetics 2014.

Cancer:

Townsend et al. Cancer Cell 2016.

Schramm et al. Nature Genetics 2015.

Martin et al. Nature Genetics 2013.

Ebola:

Park et al. Cell 2015

iPSC:

Burrows et al. PLOS Genetics 2016.

Computational methods:

Ziller et al. Nature Methods 2015.

Schmied et al. Bioinformatics 2015.

Břinda et al. Bioinformatics 2015

Chang et al. Molecular Cell 2014.

Marschall et al. Bioinformatics 2012.

dataset

results

dataset

Define workflows

in terms of rules

Define workflows

in terms of rules

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    script:
        "scripts/myscript.R"


rule myfiltration:
     input:
        "result/{dataset}.txt"
     output:
        "result/{dataset}.filtered.txt"
     shell:
        "mycommand {input} > {output}"


rule aggregate:
    input:
        "results/dataset1.filtered.txt",
        "results/dataset2.filtered.txt"
    output:
        "plots/myplot.pdf"
    script:
        "scripts/myplot.R"

Define workflows

in terms of rules

Define workflows

in terms of rules

rule mytask:
    input:
        "data/{sample}.txt"
    output:
        "result/{sample}.txt"
    conda:
        "software-envs/some-tool.yaml"
    shell:
        "some-tool {input} > {output}"

rule name

refer to input and output from shell command

how to create output from input

(shell, Python, R)

Directed acyclic graph (DAG) of jobs

dataset

results

dataset

scalability

automation

portability

Scheduling

Paradigm:

Workflow definition shall be independent of computing platform and available resources

Rules:

define resource usage (threads, memory, ...)

Scheduler:

solves multidimensional knapsack problem
schedules independent jobs in parallel
passes resource requirements to any backend

Scalable to any platform

workstation

compute server

cluster

grid computing

cloud computing

dataset

results

dataset

Full reproducibility:

install required software and all dependencies in exact versions

portability

scalability

automation

Software installation is a pain

source("https://bioconductor.org/biocLite.R")
biocLite("DESeq2")

easy_install snakemake

./configure --prefix=/usr/local
make
make install

cp lib/amd64/jli/*.so lib
cp lib/amd64/*.so lib
cp * $PREFIX

cpan -i bioperl

cmake ../../my_project \
    -DCMAKE_MODULE_PATH=~/devel/seqan/util/cmake \
    -DSEQAN_INCLUDE_PATH=~/devel/seqan/include
make
make install

apt-get install bwa

yum install python-h5py

install.packages("matrixpls")

Package management with

package:
  name: seqtk
  version: 1.2

source:
  fn: v1.2.tar.gz
  url: https://github.com/lh3/seqtk/archive/v1.2.tar.gz

requirements:
  build:
    - gcc
    - zlib
  run:
    - zlib

about:
  home: https://github.com/lh3/seqtk
  license: MIT License
  summary: Seqtk is a fast and lightweight tool for processing sequences

test:
  commands:
    - seqtk seq

Idea:

Normalization installation via recipes

#!/bin/bash

export C_INCLUDE_PATH=${PREFIX}/include
export LIBRARY_PATH=${PREFIX}/lib

make all
mkdir -p $PREFIX/bin
cp seqtk $PREFIX/bin

source or binary
recipe and build script

package

Easy installation and management:

no admin rights needed

conda install pandas

conda update pandas

conda remove pandas

conda env create -f myenv.yaml -n myenv

Isolated environments:

channels:
  - conda-forge
  - defaults
dependencies:
  - pandas ==0.20.3
  - statsmodels ==0.8.0
  - r-dplyr ==0.7.0
  - r-base ==3.4.1
  - python ==3.6.0

Package management with

rule mytask:
    input:
        "path/to/{dataset}.txt"
    output:
        "result/{dataset}.txt"
    conda:
        "envs/mycommand.yaml"
    shell:
        "mycommand {input} > {output}"

Integration with Snakemake

channels:
  - conda-forge
  - defaults
dependencies:
  - mycommand ==2.3.1

Over 3000 bioinformatics related packages

Over 200 contributors

Bioconda workflow

recipe
pull request
automatic linting
building
testing
human review
merge
upload

Builds and tests:

Paradigm:

transparency
open source build framework
public logs

Conclusion

For reproducible data analysis, three dimensions have to be considered.
A lightweight yet flexible approach to achieve this is to use Snakemake and Bioconda/Conda.

portability

scalability

automation

Snakemake+Bioconda (short)

By Johannes Köster

Snakemake+Bioconda (short)

Short introduction to reproducibilty with Snakemake and Bioconda

2,428

Reproducibility with Snakemake and Bioconda

Data analysis

Data analysis

Reproducible data analysis

Reproducible data analysis

Reproducible data analysis

Define workflows

in terms of rules

Define workflows

in terms of rules

Define workflows

in terms of rules

Define workflows

in terms of rules

Directed acyclic graph (DAG) of jobs

Scheduling

Scalable to any platform

Software installation is a pain

Package management with

Package management with

Integration with Snakemake

Bioconda workflow

Conclusion

Snakemake+Bioconda (short)

More from Johannes Köster