Reproducible data analysis

with Nextflow and nf-core

 

 Alexander Peltzer

Quantitative Biology Center (QBiC) Tübingen

 

 

 

Outlook

 

  • Challenges in computational biology
  • Basic principles of Nextflow
    • Parallelization
    • HPC/Cloud Computing
  • nf-core
    • Motivation
    • Approaching reproducibility

Challenges: Big Data

 

  • Data in computational biology is
    • big (PB scale)
    • diverse (sequencing, proteomics, metabolomics ...)
    • erroneous (e.g. contains sequencing errors)

We need methods and tools to analyze such data!

The FAIR1 principle

 

Findable

Accessible

Interoperable

Reproducible

 

 

DOI

qPortal2

bwHealth Cloud

VariantStores

 

 

 

 

1   The FAIR Guiding Principles for scientific data management and stewardship, Wilkinson et al. 2016

2   qPortal: A platform for data-driven biomedical research, Mohr et al. 2018

Challenges: Software dependencies

 

 

Workflows / Pipelines consist of

  • different tools
  • dozens of individual methods

 

Complex dependency trees and configuration requirements!

 

Steinbiss et al., "Companion: a web server for annotation and analysis of parasite genomes", NAR 2016

Challenges: Software dependencies

 

 

Workflows / Pipelines consist of

  • different tools
  • dozens of individual methods

 

 

"[...] of the tools selected for our comprehensive and systematic usability test, 49% were deemed "difficult to install," and 28% of the tools failed to be installed [...]."

- Mangul et al, biorxiv preprint, October 25 2018

Challenges: Reproducibility

 

  • Large-scale projects more common today
    • 1,000 Genomes Project
    • 100,000 Genomes Project UK
    • (EU 1,000,000 Genomes Project)
  • Reproduce results with older data / integrate with newer data

 

 

Many paper results are not reproducible at all

or require a lot of effort !

 

Challenges: Environmental stability

 

 

 

  • Portability and stability of code between different OS should be ensured
  • Are results different? Yes, they are ...

 

Challenges: Software dependencies

 

 

Di Tommaso et al., 2017, Nature Biotechnology

Nextflow

 

  • Custom DSL (domain-specific language) for
    • fast prototyping
    • enabling task composition
    • easy parallelization
  • Self-contained: Containerize tasks (e.g. with Docker)
  • Isolation of dependencies: Keep container - rerun analysis at any point!

(credit to E Floden, CRG Barcelona)

Nextflow: Centralised Orchestration

 

Nextflow

Cluster

  • Submit jobs to cluster nodes
  • Store data on shared storage

Storage

Platform support

(credit to E Floden, CRG Barcelona)

Nextflow: Executor abstraction

 

=> Improves code portability

#Run script locally
process.executor = 'local'

#Run script on PBS/Torque
process.executor = 'pbs'

#Run script on Kubernetes cluster
process.executor = 'k8s'

#Run script on AWS Batch
process.executor = 'awsbatch'
  • Community effort to collect production ready analysis pipelines
  • Save time in development, more testing, more updates
  • https://nf-co.re

 

Phil Ewels

Alex Peltzer

Sven Fillinger

Andreas Wilm

Maxime Garcia

+ many others!

Tiffany Delhomme

  • Community effort to collect production ready analysis pipelines
  • Save time in development, more testing, more updates
  • Initially supported by SciLifeLab, QBiC and A*Star Genome Institute Singapore

 

All pipelines adhere to requirements

  • Nextflow based
  • MIT license
  • Software bundled in Docker / Singularity
  • Continuous integration testing (e.g. Travis CI)
  • Stable release tags
  • Common pipeline usage and structure
  • Software bundled in bioconda

Need help?

 

  • nf-core Tools: To get a skeleton for new pipelines
    • Synchronization of best-practices across pipelines!
    • Linting app: To check what conforms with nf-co.re
  • Gitter: To communicate with the community!

 

  • 8 stable
  • 4 more in < 3 weeks
  • 13 in development

Comes with interactive reports!

Comes with proper documentation!

... and a lot more!

Acknowledgements

Phil Ewels (SciLifeLab, Stockholm)

Maxime Garcia (SciLifeLab, Stockholm)

Sven Fillinger (QBiC/Tü)

Harshil Patel (The Francis Crick Institute, London)

Paolo di Tommaso (CRG, Barcelona)

Evan Floden (CRG, Barcelona)

 

 

NF-Core Team

2019-02-25_nfcore-freiburg

By Alexander Peltzer

Private

2019-02-25_nfcore-freiburg

Reproducibility talk with Nextflow and nf-core.

More from Alexander Peltzer