Diogo N. Silva and Bruno Ribeiro-Gonçalves
M. Ramirez Lab
Nextflow is a reactive workflow framework and a programming DSL that eases writing [containerized] computational pipelines.
Reactive workflow framework
Create pipelines with asynchronous (and implicitly parallelized) data streams
Programming DSL
Has it's own (simple) language for building a pipeline
Containerized
Integration with container engines (Docker, Singularity, Shifter) out of the box
curl -s https://get.nextflow.io | bash
sudo apt-get install openjdk-8-jdk
Ubiquitous on UNIX. Windows: Cygwin or Linux subsystem... maybe...
Optional (but highly recommend):
Mandatory:
Disclaimer:
Creation of Nextflow pipelines was designed for bioinformaticians familiar with programming.
Execution of Nextflow pipelines if for everyone
- No need to manage temporary input/output directories/files
- No need for custom handling of concurrency (parallelization)
- A single pipeline with support for any scripting languages (Bash, Python, Perl, R, Ruby, all of them!)
- Every process (task) can be run in a container
- Portable -> abstracts pipeline building from execution (Same pipeline runs on my laptop, server, clusters, etc)
- Checkpoints and resume functionality
- Streaming capability
- Host pipeline on github and run remotely!
Process B
Process A
Channel 2
Channel 1
- Building blocks of the pipeline - Can contain Unix shell commands or code written in any scripting language (Python, Perl, R, Bash, etc).
- Executed independently and in parallel.
- Contain directives for container to use, CPU/RAM/Disk limits, etc.
- Unidirectional FIFO communication channels between processes.
RAW data
process <name> {
input:
<input channels>
// Optional
output:
<output channels>
"""
<command/code block>
"""
}
process someName {
input:
val x from Channel.value(1)
file y from Channel.fromPath("/path/to/file")
set x, file(fl) from Channel.fromFilePairs("/path")
// Optional
output:
val "some_str" into outputChannel1
set x, file("*_regex.fas") into outputChannel2
"""
echo Code goes here (bash by default)
echo Access input variables like $x or ${y}
bash_var="2"
echo And access bash variables like \$bash_var
"""
}
1. Set process name
2. Set one or more input channels
3. Output channels are optional
4. The code block is always last
process someName {
input:
val x from Channel.value(1)
file y from Channel.fromPath("/path/to/file")
set x, file(y) from Channel.fromFilePairs("/path")
// Optional
output:
val "some_str" into outputChannel1
set x, file("*_regex.fas") into outputChannel2
"""
#!/usr/bin/python3
print("Some python code")
print("Access variables like ${x} and ${y}")
"""
}
1. Set process name
2. Set one or more input channels
3. Output channels are optional
4. The code block is always last
5. You can change the interpreter!
process someName {
input:
val x from Channel.value(1)
file y from Channel.fromPath("/path/to/file")
set x, file(y) from Channel.fromFilePairs("/path")
// Optional
output:
val "some_str" into outputChannel1
set x, file("*_regex.fas") into outputChannel2
script:
"""
template my_script.py
"""
}
1. Set process name
2. Set one or more input channels
3. Output channels are optional
4. The code block is always last
5. Or use templates!
#!/usr/bin/python
# Store this in templates/my_script.py
def main():
print("Variables are here ${x} ${y}")
main()
startChannel = Channel.fromFilePairs(params.fastq)
process fastQC {
input:
set sampleId, file(fastq) from startChannel
output:
set sampleId, file(fastq) into fastqcOut
"""
fastqc --extract --nogroup --format fastq \
--threads ${task.cpus} ${fastq}
"""
}
process mapping {
input:
set sampleId, file(fastq) from fastqcOut
each file(ref) from Channel.fromPath(params.reference)
"""
bowtie2-build --threads ${task.cpus} \
${ref} genome_index
bowtie2 --threads ${task.cpus} -x genome_index \
-1 ${fastq[0]} -2 ${fastq[1]} -S mapping.sam
"""
}
fastQC
mapping
startChannel
fastqcOut
reference
process mapping {
container "ummidock/bowtie2_samtools:1.0.0-1"
cpus 1
memory "1GB"
publishDir "results/bowtie/"
input:
set sampleId, file(fastq) from fastqcOut
each file(ref) from Channel.fromPath(params.reference)
output:
file '*_mapping.sam'
"""
<command block>
"""
}
fastQC
mapping
startChannel
AssemblerOut
CPU: 2
Mem: 3Gb
CPU: 1
Mem: 1Gb
- Set different containers for each process
- Set custom CPU/RAM profiles that maximize pipeline performance
- Dozens of more options at directives documentation
// nextflow.config
params {
fastq = "data/*_{1,2}.*"
reference = "ref/*.fasta"
}
process {
$fastQC.container = "ummidock/fastqc:0.11.5-1"
$mapping.container = "ummidock/bowtie2_samtools:1.0.0-1"
$mapping.publishDir = "results/bowtie/"
$fastQC.cpus = 2
$fastQC.memory = "2GB"
$mapping.cpus = 4
$mapping.memory = "2GB"
}
profiles {
standard {
docker.enabled = true
}
}
nextflow.config
- Parameters
- Process directives
- Docker options
- Profiles
// nextflow.config
profiles {
standard {
docker.enabled = true
}
lobo {
process.executor = "slurm"
shifter.enabled = true
process.$mapping.cpus = 8
}
}
Standard profile is default
Profiles can overwrite more general instructions
fastQC
mapping
startChannel
fastqcOut
Operatos can transform and connect other channels without interfering with processes
There are dozen of available operators
Channel
.from( 'a', 'b', 'aa', 'bc', 3, 4.5 )
.filter( ~/^a.*/ )
.into { newChannel }
a
aa
Filter channels
Channel
.from( 1, 2, 3 )
.map { it * it }
.into {newChannel}
1
4
9
Map channels
Channel
.from( 'a', 'b', 'aa', 'bc', 3, 4.5 )
.into { newChannel1; NewChannel2 }
Fork channels
Channel
.from( 1, 2, 3, 4 )
.collect()
.into { OneChannel }
[1,2,3,4]
Collect channels
process integrity_coverage {
input:
<inputs>
output:
set fastq_id,
file(fastq_pair),
file('*_encoding'),
file('*_phred'),
file('*_coverage'),
file('*_max_len') into MAIN_integrity
script:
template "integrity_coverage.py"
}
LOG_corrupted = Channel.create()
MAIN_PreCoverageCheck = Channel.create()
MAIN_integrity.choice(LOG_corrupted, MAIN_PreCoverageCheck) {
a -> a[2].text == "corrupt" ? 0 : 1
}
MAIN_outputChannel = Channel.create()
SIDE_phred = Channel.create()
SIDE_max_len = Channel.create()
MAIN_PreCoverageCheck
.filter{ it[4].text != "fail" }
.separate({{ output_channel }}, SIDE_phred_{{ pid }}, SIDE_max_len_{{ pid }}){
a -> [ [a[0], a[1]], [a[0], a[3].text], [a[0], a[5].text] ]
}
process report_corrupt {
input:
val fastq_id from LOG_corrupted.collect{it[0]}
output:
file 'corrupted_samples.txt'
"""
echo ${fastq_id.join(",")} | tr "," "\n" >> corrupted_samples.txt
"""
}
Example:
Nextflow website: https://www.nextflow.io/
Nextflow docs: https://www.nextflow.io/docs/latest/index.html
Awesome nextflow: https://github.com/nextflow-io/awesome-nextflow
Acknowledgements
M. Ramirez Lab