Community built bioinformatics pipelines

Alexander Peltzer

https://nf-co.re

Challenges: Big Data

 

  • Data in computational (biology, physics, chemistry ...) is
    • big (PB scale)
    • diverse (e.g. sequencing, proteomics, ...)
    • erroneous (e.g. contains sequencing errors)

We need methods and tools to analyze such data!

Challenges: Software dependencies

 

 

Workflows / Pipelines consist of

  • different tools
  • dozens of individual methods

 

Complex dependency trees and configuration requirements!

 

Steinbiss et al., "Companion: a web server for annotation and analysis of parasite genomes", NAR 2016

Challenges: Software dependencies

 

 

"[...] of the tools selected for our comprehensive and systematic usability test, 51% were deemed "difficult to install," and 28% of the tools failed to be installed [...]."

- Mangul et al, PLOS Biology, June 20 2019

Challenges: Reproducibility

 

 

 

Many paper results are hard to reproduce!

 

Credit: https://twitter.com/gigascience/status/1029155083731185664

Nextflow

 

  • Custom DSL (domain-specific language) for
    • fast prototyping
    • enabling task composition
    • easy parallelization
  • Self-contained: Containerize tasks (e.g. with Docker)
  • Isolation of dependencies: Keep container - rerun analysis at any point!

Dataflow Model

Dataflow Model

BWA

Samtools

Nextflow: Executor abstraction

 

=> Improves code portability

#Run script locally
process.executor = 'local'

#Run script on PBS/Torque
process.executor = 'pbs'

#Run script on Kubernetes cluster
process.executor = 'k8s'

#Run script on AWS Batch
process.executor = 'awsbatch'

#Run script on Google Pipelines
process.executor = 'google-pipelines' 
  • Community effort to collect production ready analysis pipelines
  • Save time in development, more testing, more updates
  • https://nf-co.re

 

Phil Ewels

Alex Peltzer

Sven Fillinger

Maxime Garcia

+ many others!

Harshil Patel

Andreas Wilm

20+ institutions, others joining!

All pipelines adhere to requirements

  • Nextflow based
  • MIT license
  • Software bundled in Docker / Singularity
  • Continuous integration testing (e.g. Travis CI)
  • Stable release tags
  • Common pipeline usage and structure
  • Software bundled in bioconda
 # Lint the pipeline code
 - nf-core lint ${TRAVIS_BUILD_DIR}
 # Lint the documentation
 - markdownlint ${TRAVIS_BUILD_DIR} -c ${TRAVIS_BUILD_DIR}/.github/markdownlint.yml
 # Run, build reference genome with STAR
 - nextflow run ${TRAVIS_BUILD_DIR} -profile test,docker
 # Run, build reference genome with HISAT2
 - nextflow run ${TRAVIS_BUILD_DIR} -profile test,docker --aligner hisat2
rule plot:
    input:
        "raw/{dataset}.csv"
    output:
        "plots/{dataset}.pdf"
    shell:
        "somecommand {input} {output}"
process plot {

input:
     file dataset from ch_plots

output:
     file "plot_me.pdf" into ch_downstream

script:
"""
somecommand $dataset plot_me.pdf
"""
}

SnakeMake

Nextflow

Easy to learn, migrate code, dataflow model is very flexible

  • 15 stable
  • 18 in development
//Profile config names for nf-core/configs
params {
  config_profile_description = 'BINAC cluster profile provided by nf-core/configs.'
  config_profile_contact = 'Alexander Peltzer (@apeltzer)'
  config_profile_url = 'https://www.bwhpc-c5.de/wiki/index.php/Category:BwForCluster_BinAC'
}

singularity {
  enabled = true
}

process {
  beforeScript = 'module load devel/singularity/3.0.3'
  executor = 'pbs'
  queue = 'short'
}

params {
  igenomes_base = '/nfsmounts/igenomes'
  max_memory = 128.GB
  max_cpus = 28
  max_time = 48.h
}

nf-core/configs

Comes with interactive reports!

Comes with proper documentation!

... and a lot more!

Whats next with nf-core?

  • Biocontainers integration
  • Automated Cloud Tests (Price estimates?)
  • Automated full-size testing
  • nf-core/modules (Nextflow DSLv2)

Acknowledgements

Phil Ewels (SciLifeLab, Stockholm)

Maxime Garcia (SciLifeLab, Stockholm)

Harshil Patel (The Francis Crick Institute, London)

Sven Fillinger (QBiC/Tü)

Paolo di Tommaso (CRG, Barcelona)

Evan Floden (CRG, Barcelona)

 

and all contributors!

NF-Core Team

Preprint: https://www.biorxiv.org/content/10.1101/610741v3

(Paper in Revision)

2019-07-26-AEBC2

By Alexander Peltzer

Private

2019-07-26-AEBC2

nf-core presentation for AEBC2

More from Alexander Peltzer