A modular, extensible and flexible tool to build, monitor and report nextflow pipelines

C I Mendes

@ines_cim

cimendes

Enables scalable and reproducible scientific workflows using software containers. It simplifies the deployment of complex parallel and reactive workflows.

Reactive workflow framework

Programming DSL

Containerized

Create pipelines with asynchronous (and implicitly parallelized) data streams

Has it's own language for building a pipeline

Integration with container engines (Docker, Singularity, Shifter) out of the box

Nextflow | What is it?

Nextflow | Requirements

Requirements

Bash

Java 1.8

Nextflow

Ubiquitous on UNIX. Windows: Cygwin or Linux subsystem... maybe...

sudo apt-get install openjdk-8-jdk

curl -s https://get.nextflow.io | bash

Optional (but recommended)

Container engine (Docker, Singularity...)

Creation of Nextflow pipelines was designed for bioinformaticians familiar with programming.

Execution of Nextflow pipelines if for everyone

The motivation

The game changing combination of nextflow + containers:

Fast pace of bioinformatics software landscape
Continuous need for benchmarking and comparative analyses
The need for agile and dynamic pipeline building
Remove the pain of changing inner workings of workflows

Portability
Reproducible
Scalability
Multi-scale containerization
Native cloud support

But substantial challenges persist:

The FlowCraft project

The premise:

Workflow based development

Component based development

Components are modular pieces of nextflow code with some basic rules:

Component A

- Input/Output

- Parameters

- Resources

Component B

- Input/Output

- Parameters

- Resources

The component

IN_adapters_{{ pid }} = Channel
    .value(params.adapters{{ param_id }})

process fastqc2 {

    tag { fastq_id }

    input:
    set fastq_id, file(fastq_pair) from {{ input_channel }}
    val ad from IN_adapters

    output:
    set fastq_id, file(fastq_pair) into {{ output_channel }}

    script:
    template "fastqc.py"
}

{{ forks }}

Nextflow template file

Standard nextflow code (1 or more processes, channels, etc)

Addition of placeholders that allow the engine to orchestrate components into a workflow

The component

class Fastqc(Process):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)

        self.input_type = "fastq"
        self.output_type = "fastq"

        self.params = {
            "adapters": {
                "default": "'None'",
                "description":
                    "Path to adapters files, if any."
            }
        }

        self.directives = {"fastqc2": {
            "cpus": 2,
            "memory": "'4GB'",
            "container": "flowcraft/fastqc",
            "version": "0.11.7-1"
        }}

Python declarative class

Specify input/output types so components can be connected

Add any number/type of parameters

Add process directives for one or more processes

And many other attributes that let you easily configure your component:

https://flowcraft.readthedocs.io/en/latest/dev/create_process.html#process-attributes

Crafting pipelines

With this framework, building workflows becomes simple:

flowcraft build -t 'trimmomatic fastqc spades pilon' -o my_nextflow_pipeline

Results in the following workflow DAG

$ nextflow run my_nextflow_pipeline.nf --help
N E X T F L O W  ~  version 0.32.0
Launching `my_nextflow_pipeline.nf` [jovial_swirles] - revision: b4473f5a12

============================================================
                F L O W C R A F T
============================================================
Built using flowcraft v1.4.0


Usage: 
    nextflow run my_nextflow_pipeline.nf

       --fastq                     Path expression to paired-end fastq files. (default: fastq/*_{1,2}.*) (default: 'fastq/*_{1,2}.*')
       
       Component 'INTEGRITY_COVERAGE_1_1'
       ----------------------------------
       --genomeSize_1_1            Genome size estimate for the samples in Mb. It is used to estimate the coverage and other assembly parameters andchecks (default: 1)
       --minCoverage_1_1           Minimum coverage for a sample to proceed. By default it's setto 0 to allow any coverage (default: 0)
       
       Component 'TRIMMOMATIC_1_2'
       ---------------------------
       --adapters_1_2              Path to adapters files, if any. (default: 'None')
       --trimSlidingWindow_1_2     Perform sliding window trimming, cutting once the average quality within the window falls below a threshold (default: '5:20')
       --trimLeading_1_2           Cut bases off the start of a read, if below a threshold quality (default: 3)
       --trimTrailing_1_2          Cut bases of the end of a read, if below a threshold quality (default: 3)
       --trimMinLength_1_2         Drop the read if it is below a specified length  (default: 55)
       --clearInput_1_2            Permanently removes temporary input files. This option is only useful to remove temporary files in large workflows and prevents nextflow's resume functionality. Use with caution. (default: false)
       
       Component 'FASTQC_1_3'
       ----------------------
       --adapters_1_3              Path to adapters files, if any. (default: 'None')
       
       Component 'SPADES_1_4'
       ----------------------
       --spadesMinCoverage_1_4     The minimum number of reads to consider an edge in the de Bruijn graph during the assembly (default: 2)
       --spadesMinKmerCoverage_1_4 Minimum contigs K-mer coverage. After assembly only keep contigs with reported k-mer coverage equal or above this value (default: 2)
       --spadesKmers_1_4           If 'auto' the SPAdes k-mer lengths will be determined from the maximum read length of each assembly. If 'default', SPAdes will use the default k-mer lengths.  (default: 'auto')
       --clearInput_1_4            Permanently removes temporary input files. This option is only useful to remove temporary files in large workflows and prevents nextflow's resume functionality. Use with caution. (default: false)
       --disableRR_1_4             disables repeat resolution stage of assembling. (default: false)
       
       Component 'ASSEMBLY_MAPPING_1_5'
       --------------------------------
       --minAssemblyCoverage_1_5   In auto, the default minimum coverage for each assembled contig is 1/3 of the assembly mean coverage or 10x, if the mean coverage is below 10x (default: 'auto')
       --AMaxContigs_1_5           A warning is issued if the number of contigs is overthis threshold. (default: 100)
       --genomeSize_1_5            Genome size estimate for the samples. It is used to check the ratio of contig number per genome MB (default: 2.1)
       
       Component 'PILON_1_6'
       ---------------------
       --clearInput_1_6            Permanently removes temporary input files. This option is only useful to remove temporary files in large workflows and prevents nextflow's resume functionality. Use with caution. (default: false)

Help and parameters tailor-made to the pipeline

Crafting pipelines

It's easy to get experimental:

flowcraft build -t 'trimmomatic fastqc skesa pilon' -o my_nextflow_pipeline

Switch spades for skesa

flowcraft build -t 'trimmomatic fastqc skesa pilon (abricate | prokka)' -o my_nextflow_pipeline

Add genome annotation components in the end

Crafting pipelines

It's easy to get wild:

flowcraft build -t 'reads_download (
    spades | skesa pilon (abricate | chewbbaca) | megahit | 
    fastqc_trimmomatic fastqc (spades pilon (
        mlst | prokka | chewbbaca) | skesa pilon abricate))' -o my_nextflow_pipeline

wait, what?