Scalable, reproducible bioinformatics workflows using Nextflow & nf-core

Dr. Alexander Peltzer

 

January 14, 2020

Data in computational biology is ...

 

  • big (TB / PB scale)
  • diverse (e.g. proteomics, sequencing, ...)
  • erroneous (e.g. contains measure errors)

We need methods and tools to analyze such data!

Challenges: Software dependencies

 

 

Workflows / Pipelines consist of

  • different tools
  • dozens of individual methods

 

Complex dependency trees and configuration requirements!

 

 

Steinbiss et al., "Companion: a web server for annotation and analysis of parasite genomes", NAR 2016

Challenges: Reproducibility

 

 

 

Many paper results are hard to reproduce!

 

Credit: https://twitter.com/gigascience/status/1029155083731185664

Dataflow Model

 

  • Custom DSL (domain-specific language) for
    • fast prototyping
    • enabling task composition
    • easy parallelization
  • Self-contained: Containerize tasks (e.g. with Docker)
  • Isolation of dependencies: Keep container - rerun analysis at any point!

Executor abstraction

 

=> Improves code portability

#Run script locally
process.executor = 'local'

#Run script on PBS/Torque
process.executor = 'pbs'

#Run script on Kubernetes cluster
process.executor = 'k8s'

#AWSBatch
process.executor = 'awsbatch'
  • Community effort to collect production ready analysis pipelines
  • Save time in development, more testing, more updates
  • https://nf-co.re

 

Phil Ewels

Alex Peltzer

Sven Fillinger

Maxime Garcia

+ many others!

Harshil Patel

Olga Botvinnik

20+ institutions, others joining!

All pipelines adhere to requirements

  • Nextflow based
  • MIT license (can be used even in commercial settings)
  • Software bundled in Docker / Singularity
  • Continuous integration testing (e.g. GitHub Actions)
  • Stable release tags
  • Common pipeline usage and structure
  • Software bundled in bioconda
 # Lint the pipeline code
 - nf-core lint ${TRAVIS_BUILD_DIR}
 # Lint the documentation
 - markdownlint ${TRAVIS_BUILD_DIR} -c ${TRAVIS_BUILD_DIR}/.github/markdownlint.yml
 # Run, build reference genome with STAR
 - nextflow run ${TRAVIS_BUILD_DIR} -profile test,docker
 # Run, build reference genome with HISAT2
 - nextflow run ${TRAVIS_BUILD_DIR} -profile test,docker --aligner hisat2

Comes with interactive reports / documentation

  • 21 stable
  • 15 in development

So what do I get?

Recap: What (almost) broke my neck?

 

  • Changing use-cases (PhD lasts 3+ years)
  • Maintenance, Maintenance, Maintenance!
  • Conda? Containers? Schedulers? HPC? Cloud(s)?

 

 

 

 

So what do I get?

True reproducibility!

nextflow run nf-core/eager -r 2.0.7 ...

VS

So what do I get?

Automated Tests!

So what do I get?

Automated Tests!

So what do I get?

An entire ecosystem!

Do's and Don'ts

  • Start small, most important things first
    • Dependencies, Containers
    • Minimum working example
    • Tests, Tests, and even more Tests!
  • Follow community guidelines

 

 

It's never too late!

Whats next with nf-core?

  • Biocontainers integration
  • Automated Cloud Tests (Price estimates?)
  • Automated full-size testing
  • nf-core/modules (Nextflow DSLv2)

Acknowledgements

Phil Ewels (SciLifeLab, Stockholm)

Maxime Garcia (SciLifeLab, Stockholm)

Harshil Patel (The Francis Crick Institute, London)

Sven Fillinger (QBiC/Tü)

Johannes Alneberg (SciLifeLab, Stockholm)

Olga Botvinnik (CZ Biohub, San Francisco)

 

Paolo di Tommaso (CRG, Barcelona)

Evan Floden (CRG, Barcelona)

and all contributors!

Preprint: https://www.biorxiv.org/content/10.1101/610741v3

Paper accepted January 9 - out soon!

EuBic2020 nf-core

By Alexander Peltzer

More from Alexander Peltzer