Reproducible data analysis with 

Johannes Köster

University of Duisburg-Essen                https://koesterlab.github.io

The 3 dimensions of reproducibility

dataset

results

dataset

dataset

dataset

dataset

dataset

portability

scalability

automation/

documentation

Automation/documentation:

document and execute all steps from raw data to final tables and figures without manual intervention

 

Scalability:

Execute for tens to thousands of datasets.

Efficiently use any computing platform.

 

Portability:

Easily execute analysis on a different system/platform/architecture.

Automation/documentation

rule estimate_spike_proportion:
    input:
        "analysis/all.sce.rds"
    output:
        report("plots/spike-proportion.svg", 
               category="Quality control",
               caption="report/spike-proportion.rst")
    script:
        "scripts/plot-spike-proportion.R"

General:

  • Decompose analysis into rules, written in a Python dialect.
  • Rules define how to obtain output files from input files.
  • Snakemake determines dependencies and execution order in the form of a directed acyclic graph (DAG) of jobs.

Automatic reports:

  • Annotate output files for inclusion.
  • Define categories and (jinja-templated) captions.
  • Obtain self-contained HTML5 document including all files, workflow description, runtime statistics, and provenance information.

Scalability

General:

  • Independent parts of the DAG of jobs can be executed in parallel.
  • Snakemake maximizes parallelism while respecting given resources.
  • Without modification of the workflow definition, Snakemake can scale to any number of cores, compute clusters, the grid, and the cloud.

Job groups:

  • The DAG of jobs can be partitioned into groups.
  • Minimizes queueing and network overhead in cloud and cluster.
rule bwa:
    input:
        "genome.fa"
        "reads/{sample}.fastq"
    output:
        "mapped/{sample}.bam"
    group: "mapping"
    threads: 8
    shell:
        "bwa mem -t {threads} {input} | "
        "samtools view -Sb - > {output}"

Pipe output:

  • Output files can be marked as pipes.
  • Consuming jobs will be assigned to the same group.
  • Output will not be written to disk but streamed between the jobs.
rule bwa:
    input:
        "genome.fa"
        "reads/{sample}.fastq"
    output:
        pipe("mapped/{sample}.bam")
    threads: 8
    shell:
        "bwa mem -t {threads} {input} | "
        "samtools view -Sb - > {output}"

Portability

Software deployment with Conda:

  • Rules can be annotated with (isolated) Conda environments that define a software stack with particular versions to use.
  • Jobs are executed within these environments.

Software deployment with Singularity:

  • Rules/workflows can be annotated with container images.
  • Jobs are executed within the container.
  • Combination with Conda possible: use container image to define OS, use Conda to define the software stack, let Snakemake perform the composition.
singularity: "docker://continuumio/miniconda3"


rule estimate_spike_proportion:
    input:
        "analysis/all.sce.rds"
    output:
        "plots/spike-proportion.svg"
    conda:
        "envs/r-qc.yaml"
    script:
        "scripts/plot-spike-proportion.R"

https://snakemake.readthedocs.io

Snakemake Poster GCCBOSC 2018

By Johannes Köster

Snakemake Poster GCCBOSC 2018

  • 1,932