FlowCraft: A modular, extensible and flexible tool to build, monitor and report nextflow pipelines

Diogo N Silva

@ODiogoSilva

ODiogoSilva

The motivation

The game changing combination of nextflow + containers:

  • Fast pace of bioinformatics software landscape
  • Continuous need for benchmarking and comparative analyses
  • The need for agile and dynamic pipeline building
  • Remove the pain of changing inner workings of workflows
  • Portability
  • Reproducible
  • Scalability
  • Multi-scale containerization
  • Native cloud support

But substantial challenges persist:

The FlowCraft project

The premise:

Workflow based development

Component based development

Components are modular pieces of nextflow code with some basic rules:

Component A

- Input/Output

- Parameters

- Resources

Component B

- Input/Output

- Parameters

- Resources

The component

IN_adapters_{{ pid }} = Channel
    .value(params.adapters{{ param_id }})

process fastqc2 {

    tag { fastq_id }

    input:
    set fastq_id, file(fastq_pair) from {{ input_channel }}
    val ad from IN_adapters

    output:
    set fastq_id, file(fastq_pair) into {{ output_channel }}

    script:
    template "fastqc.py"
}

{{ forks }}

Nextflow template file

  • Standard nextflow code (1 or more processes, channels, etc)
  • Addition of placeholders that allow the engine to orchestrate components into a workflow

The component

class Fastqc(Process):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)

        self.input_type = "fastq"
        self.output_type = "fastq"

        self.params = {
            "adapters": {
                "default": "'None'",
                "description":
                    "Path to adapters files, if any."
            }
        }

        self.directives = {"fastqc2": {
            "cpus": 2,
            "memory": "'4GB'",
            "container": "flowcraft/fastqc",
            "version": "0.11.7-1"
        }}

Python declarative class

Specify input/output types so components can be connected

Add any number/type of parameters

Add process directives for one or more processes

And many other attributes that let you easily configure your component:

https://flowcraft.readthedocs.io/en/latest/dev/create_process.html#process-attributes

Crafting pipelines

With this framework, building workflows becomes dead simple:

flowcraft build -t 'trimmomatic fastqc spades pilon' -o my_nextflow_pipeline

Results in the following workflow DAG (direct acyclic graph)

$ nextflow run my_nextflow_pipeline.nf --help
N E X T F L O W  ~  version 0.32.0
Launching `my_nextflow_pipeline.nf` [jovial_swirles] - revision: b4473f5a12

============================================================
                F L O W C R A F T
============================================================
Built using flowcraft v1.4.0


Usage: 
    nextflow run my_nextflow_pipeline.nf

       --fastq                     Path expression to paired-end fastq files. (default: fastq/*_{1,2}.*) (default: 'fastq/*_{1,2}.*')
       
       Component 'INTEGRITY_COVERAGE_1_1'
       ----------------------------------
       --genomeSize_1_1            Genome size estimate for the samples in Mb. It is used to estimate the coverage and other assembly parameters andchecks (default: 1)
       --minCoverage_1_1           Minimum coverage for a sample to proceed. By default it's setto 0 to allow any coverage (default: 0)
       
       Component 'TRIMMOMATIC_1_2'
       ---------------------------
       --adapters_1_2              Path to adapters files, if any. (default: 'None')
       --trimSlidingWindow_1_2     Perform sliding window trimming, cutting once the average quality within the window falls below a threshold (default: '5:20')
       --trimLeading_1_2           Cut bases off the start of a read, if below a threshold quality (default: 3)
       --trimTrailing_1_2          Cut bases of the end of a read, if below a threshold quality (default: 3)
       --trimMinLength_1_2         Drop the read if it is below a specified length  (default: 55)
       --clearInput_1_2            Permanently removes temporary input files. This option is only useful to remove temporary files in large workflows and prevents nextflow's resume functionality. Use with caution. (default: false)
       
       Component 'FASTQC_1_3'
       ----------------------
       --adapters_1_3              Path to adapters files, if any. (default: 'None')
       
       Component 'SPADES_1_4'
       ----------------------
       --spadesMinCoverage_1_4     The minimum number of reads to consider an edge in the de Bruijn graph during the assembly (default: 2)
       --spadesMinKmerCoverage_1_4 Minimum contigs K-mer coverage. After assembly only keep contigs with reported k-mer coverage equal or above this value (default: 2)
       --spadesKmers_1_4           If 'auto' the SPAdes k-mer lengths will be determined from the maximum read length of each assembly. If 'default', SPAdes will use the default k-mer lengths.  (default: 'auto')
       --clearInput_1_4            Permanently removes temporary input files. This option is only useful to remove temporary files in large workflows and prevents nextflow's resume functionality. Use with caution. (default: false)
       --disableRR_1_4             disables repeat resolution stage of assembling. (default: false)
       
       Component 'ASSEMBLY_MAPPING_1_5'
       --------------------------------
       --minAssemblyCoverage_1_5   In auto, the default minimum coverage for each assembled contig is 1/3 of the assembly mean coverage or 10x, if the mean coverage is below 10x (default: 'auto')
       --AMaxContigs_1_5           A warning is issued if the number of contigs is overthis threshold. (default: 100)
       --genomeSize_1_5            Genome size estimate for the samples. It is used to check the ratio of contig number per genome MB (default: 2.1)
       
       Component 'PILON_1_6'
       ---------------------
       --clearInput_1_6            Permanently removes temporary input files. This option is only useful to remove temporary files in large workflows and prevents nextflow's resume functionality. Use with caution. (default: false)

Help and parameters tailor-made to the pipeline

Crafting pipelines

It's easy to get experimental:

flowcraft build -t 'trimmomatic fastqc skesa pilon' -o my_nextflow_pipeline

Switch spades for skesa

flowcraft build -t 'trimmomatic fastqc skesa pilon (abricate | prokka)' -o my_nextflow_pipeline

Add genome annotation components in the end

Crafting pipelines

It's easy to get wild:

flowcraft build -t 'reads_download (
    spades | skesa pilon (abricate | chewbbaca) | megahit | 
    fastqc_trimmomatic fastqc (spades pilon (
        mlst | prokka | chewbbaca) | skesa pilon abricate))' -o my_nextflow_pipeline

wait, what?

More building features

Forks

Connect one component to multiple

Secondary channels

Connect non-adjacent components

Extra inputs

Inject user input data anywhere

Recipes

Curated and pre-assembled pipelines for specific needs

Workflow live monitoring

Tracks execution in real time

Minimal requirements for general nextflow pipelines

Workflow live reports

Dynamic generation of  interactive report page

Reports can be updated live or viewed at the end of the run

Demo time!

The future

+

Flexible and modular pipeline builder

Deploit platform for deploying nextflow pipelines on the cloud

(AWS, Azure and

soon Google cloud)

Easily craft your own nextflow pipeline and deploy it on the cloud on any scale

But you still need to provide computational resources

The future - is arriving

Diogo N Silva

Tiago F Jesus

Catarina I Mendes

Bruno

Ribeiro-Gonçalves

Core developers

Advisors

Prof. Mário Ramirez

Prof. João André Carriço

The team

Lifebit Bioinformatician

Lifebit Bioinformatician

Lifebit Bioinformatician

PhD at IMM

PI at IMM

Researcher at IMM

Thank you for your attention

and happy pipeline building

Join the fun!

conda install flowcraft
pip install flowcraft

BacGenTrack project [FCT / Scientific and Technological Research Council of Turkey, TUBITAK/0004/2014]

Funding and ackowledgements

ONEIDA project (LISBOA-01-0145-FEDER-016417) co-founded by 'Fundos Internacionais Europeus Estruturais e de Investimento' and the national funds from FCT - Fundação para a Ciência e Tecnologia

Lifebit Biotech Ltd.

Flowcraft - Crick 2019

By Diogo Silva

Flowcraft - Crick 2019

  • 855