Diogo N Silva
@ODiogoSilva
ODiogoSilva
The game changing combination of nextflow + containers:
But substantial challenges persist:
The premise:
Workflow based development
Component based development
Components are modular pieces of nextflow code with some basic rules:
Component A
- Input/Output
- Parameters
- Resources
Component B
- Input/Output
- Parameters
- Resources
IN_adapters_{{ pid }} = Channel
.value(params.adapters{{ param_id }})
process fastqc2 {
tag { fastq_id }
input:
set fastq_id, file(fastq_pair) from {{ input_channel }}
val ad from IN_adapters
output:
set fastq_id, file(fastq_pair) into {{ output_channel }}
script:
template "fastqc.py"
}
{{ forks }}
Nextflow template file
class Fastqc(Process):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.input_type = "fastq"
self.output_type = "fastq"
self.params = {
"adapters": {
"default": "'None'",
"description":
"Path to adapters files, if any."
}
}
self.directives = {"fastqc2": {
"cpus": 2,
"memory": "'4GB'",
"container": "flowcraft/fastqc",
"version": "0.11.7-1"
}}
Python declarative class
Specify input/output types so components can be connected
Add any number/type of parameters
Add process directives for one or more processes
And many other attributes that let you easily configure your component:
https://flowcraft.readthedocs.io/en/latest/dev/create_process.html#process-attributes
With this framework, building workflows becomes dead simple:
flowcraft build -t 'trimmomatic fastqc spades pilon' -o my_nextflow_pipeline
Results in the following workflow DAG (direct acyclic graph)
$ nextflow run my_nextflow_pipeline.nf --help
N E X T F L O W ~ version 0.32.0
Launching `my_nextflow_pipeline.nf` [jovial_swirles] - revision: b4473f5a12
============================================================
F L O W C R A F T
============================================================
Built using flowcraft v1.4.0
Usage:
nextflow run my_nextflow_pipeline.nf
--fastq Path expression to paired-end fastq files. (default: fastq/*_{1,2}.*) (default: 'fastq/*_{1,2}.*')
Component 'INTEGRITY_COVERAGE_1_1'
----------------------------------
--genomeSize_1_1 Genome size estimate for the samples in Mb. It is used to estimate the coverage and other assembly parameters andchecks (default: 1)
--minCoverage_1_1 Minimum coverage for a sample to proceed. By default it's setto 0 to allow any coverage (default: 0)
Component 'TRIMMOMATIC_1_2'
---------------------------
--adapters_1_2 Path to adapters files, if any. (default: 'None')
--trimSlidingWindow_1_2 Perform sliding window trimming, cutting once the average quality within the window falls below a threshold (default: '5:20')
--trimLeading_1_2 Cut bases off the start of a read, if below a threshold quality (default: 3)
--trimTrailing_1_2 Cut bases of the end of a read, if below a threshold quality (default: 3)
--trimMinLength_1_2 Drop the read if it is below a specified length (default: 55)
--clearInput_1_2 Permanently removes temporary input files. This option is only useful to remove temporary files in large workflows and prevents nextflow's resume functionality. Use with caution. (default: false)
Component 'FASTQC_1_3'
----------------------
--adapters_1_3 Path to adapters files, if any. (default: 'None')
Component 'SPADES_1_4'
----------------------
--spadesMinCoverage_1_4 The minimum number of reads to consider an edge in the de Bruijn graph during the assembly (default: 2)
--spadesMinKmerCoverage_1_4 Minimum contigs K-mer coverage. After assembly only keep contigs with reported k-mer coverage equal or above this value (default: 2)
--spadesKmers_1_4 If 'auto' the SPAdes k-mer lengths will be determined from the maximum read length of each assembly. If 'default', SPAdes will use the default k-mer lengths. (default: 'auto')
--clearInput_1_4 Permanently removes temporary input files. This option is only useful to remove temporary files in large workflows and prevents nextflow's resume functionality. Use with caution. (default: false)
--disableRR_1_4 disables repeat resolution stage of assembling. (default: false)
Component 'ASSEMBLY_MAPPING_1_5'
--------------------------------
--minAssemblyCoverage_1_5 In auto, the default minimum coverage for each assembled contig is 1/3 of the assembly mean coverage or 10x, if the mean coverage is below 10x (default: 'auto')
--AMaxContigs_1_5 A warning is issued if the number of contigs is overthis threshold. (default: 100)
--genomeSize_1_5 Genome size estimate for the samples. It is used to check the ratio of contig number per genome MB (default: 2.1)
Component 'PILON_1_6'
---------------------
--clearInput_1_6 Permanently removes temporary input files. This option is only useful to remove temporary files in large workflows and prevents nextflow's resume functionality. Use with caution. (default: false)
Help and parameters tailor-made to the pipeline
It's easy to get experimental:
flowcraft build -t 'trimmomatic fastqc skesa pilon' -o my_nextflow_pipeline
Switch spades for skesa
flowcraft build -t 'trimmomatic fastqc skesa pilon (abricate | prokka)' -o my_nextflow_pipeline
Add genome annotation components in the end
It's easy to get wild:
flowcraft build -t 'reads_download (
spades | skesa pilon (abricate | chewbbaca) | megahit |
fastqc_trimmomatic fastqc (spades pilon (
mlst | prokka | chewbbaca) | skesa pilon abricate))' -o my_nextflow_pipeline
wait, what?
Forks
Connect one component to multiple
Secondary channels
Connect non-adjacent components
Extra inputs
Inject user input data anywhere
Recipes
Curated and pre-assembled pipelines for specific needs
Tracks execution in real time
Minimal requirements for general nextflow pipelines
Dynamic generation of interactive report page
Reports can be updated live or viewed at the end of the run
Flexible and modular pipeline builder
Deploit platform for deploying nextflow pipelines on the cloud
(AWS, Azure and
soon Google cloud)
Easily craft your own nextflow pipeline and deploy it on the cloud on any scale
But you still need to provide computational resources
Diogo N Silva
Tiago F Jesus
Catarina I Mendes
Bruno
Ribeiro-Gonçalves
Prof. Mário Ramirez
Prof. João André Carriço
Lifebit Bioinformatician
Lifebit Bioinformatician
Lifebit Bioinformatician
PhD at IMM
PI at IMM
Researcher at IMM
and happy pipeline building
conda install flowcraft
pip install flowcraft
BacGenTrack project [FCT / Scientific and Technological Research Council of Turkey, TUBITAK/0004/2014]
Funding and ackowledgements
ONEIDA project (LISBOA-01-0145-FEDER-016417) co-founded by 'Fundos Internacionais Europeus Estruturais e de Investimento' and the national funds from FCT - Fundação para a Ciência e Tecnologia
Lifebit Biotech Ltd.