THOR 17
3/07/2018
DevOps and workflow automation in bioinformatics and computational biology
Diogo N. Silva and Bruno Ribeiro-Gonçalves
M. Ramirez Lab
Why?
- Analyze large amount of sequence data routinely
- Some computationally intensive steps
- Constantly updating/adding software
How?
- Create a pipeline once, run it everywhere!
- Containerized software and version control
- Pipeline reproducibility
- Parallelization
- Modularity
What is NextFlow
Nextflow is a reactive workflow framework and a programming DSL that eases writing [containerized] computational pipelines.
Reactive workflow framework
Create pipelines with asynchronous (and implicitly parallelized) data streams
Programming DSL
Has it's own (simple) language for building a pipeline
Containerized
Integration with container engines (Docker, Singularity, Shifter) out of the box
Nextflow | Requirements
- Bash
- Java 1.8
- Nextflow
curl -s https://get.nextflow.io | bash
sudo apt-get install openjdk-8-jdk
Ubiquitous on UNIX. Windows: Cygwin or Linux subsystem... maybe...
Optional (but highly recommend):
Mandatory:
- Container engine (Docker, Singularity)
Disclaimer:
Creation of Nextflow pipelines was designed for bioinformaticians familiar with programming.
Execution of Nextflow pipelines if for everyone
Nextflow | why bother?
- No need to manage temporary input/output directories/files
- No need for custom handling of concurrency (parallelization)
- A single pipeline with support for any scripting languages (Bash, Python, Perl, R, Ruby, all of them!)
- Every process (task) can be run in a container
- Portable -> abstracts pipeline building from execution (Same pipeline runs on my laptop, server, clusters, etc)
- Checkpoints and resume functionality
- Streaming capability
- Host pipeline on github and run remotely!
Process B
Process A
Channel 2
Channel 1
Creating a pipeline | key concepts
- Processes
- Building blocks of the pipeline - Can contain Unix shell commands or code written in any scripting language (Python, Perl, R, Bash, etc).
- Executed independently and in parallel.
- Contain directives for container to use, CPU/RAM/Disk limits, etc.
- Channels
- Unidirectional FIFO communication channels between processes.
RAW data
Creating a pipeline | Anatomy of a process
process <name> {
input:
<input channels>
// Optional
output:
<output channels>
"""
<command/code block>
"""
}
Creating a pipeline | Anatomy of a process
process someName {
input:
val x from Channel.value(1)
file y from Channel.fromPath("/path/to/file")
set x, file(fl) from Channel.fromFilePairs("/path")
// Optional
output:
val "some_str" into outputChannel1
set x, file("*_regex.fas") into outputChannel2
"""
echo Code goes here (bash by default)
echo Access input variables like $x or ${y}
bash_var="2"
echo And access bash variables like \$bash_var
"""
}
1. Set process name
2. Set one or more input channels
3. Output channels are optional
4. The code block is always last
Creating a pipeline | Anatomy of a process
process someName {
input:
val x from Channel.value(1)
file y from Channel.fromPath("/path/to/file")
set x, file(y) from Channel.fromFilePairs("/path")
// Optional
output:
val "some_str" into outputChannel1
set x, file("*_regex.fas") into outputChannel2
"""
#!/usr/bin/python3
print("Some python code")
print("Access variables like ${x} and ${y}")
"""
}
1. Set process name
2. Set one or more input channels
3. Output channels are optional
4. The code block is always last
5. You can change the interpreter!
Creating a pipeline | Anatomy of a process
process someName {
input:
val x from Channel.value(1)
file y from Channel.fromPath("/path/to/file")
set x, file(y) from Channel.fromFilePairs("/path")
// Optional
output:
val "some_str" into outputChannel1
set x, file("*_regex.fas") into outputChannel2
script:
"""
template my_script.py
"""
}
1. Set process name
2. Set one or more input channels
3. Output channels are optional
4. The code block is always last
5. Or use templates!
#!/usr/bin/python
# Store this in templates/my_script.py
def main():
print("Variables are here ${x} ${y}")
main()
Creating a pipeline | simple example
startChannel = Channel.fromFilePairs(params.fastq)
process fastQC {
input:
set sampleId, file(fastq) from startChannel
output:
set sampleId, file(fastq) into fastqcOut
"""
fastqc --extract --nogroup --format fastq \
--threads ${task.cpus} ${fastq}
"""
}
process mapping {
input:
set sampleId, file(fastq) from fastqcOut
each file(ref) from Channel.fromPath(params.reference)
"""
bowtie2-build --threads ${task.cpus} \
${ref} genome_index
bowtie2 --threads ${task.cpus} -x genome_index \
-1 ${fastq[0]} -2 ${fastq[1]} -S mapping.sam
"""
}
fastQC
mapping
startChannel
fastqcOut
reference
Creating a pipeline | Costumize processes
process mapping {
container "ummidock/bowtie2_samtools:1.0.0-1"
cpus 1
memory "1GB"
publishDir "results/bowtie/"
input:
set sampleId, file(fastq) from fastqcOut
each file(ref) from Channel.fromPath(params.reference)
output:
file '*_mapping.sam'
"""
<command block>
"""
}
fastQC
mapping
startChannel
AssemblerOut
CPU: 2
Mem: 3Gb
CPU: 1
Mem: 1Gb
- Set different containers for each process
- Set custom CPU/RAM profiles that maximize pipeline performance
- Dozens of more options at directives documentation
Creating a pipeline | Customize everything
// nextflow.config
params {
fastq = "data/*_{1,2}.*"
reference = "ref/*.fasta"
}
process {
$fastQC.container = "ummidock/fastqc:0.11.5-1"
$mapping.container = "ummidock/bowtie2_samtools:1.0.0-1"
$mapping.publishDir = "results/bowtie/"
$fastQC.cpus = 2
$fastQC.memory = "2GB"
$mapping.cpus = 4
$mapping.memory = "2GB"
}
profiles {
standard {
docker.enabled = true
}
}
nextflow.config
- Parameters
- Process directives
- Docker options
- Profiles
Creating a pipeline | Deploy on GitHub
Creating a pipeline | Profiles
// nextflow.config
profiles {
standard {
docker.enabled = true
}
lobo {
process.executor = "slurm"
shifter.enabled = true
process.$mapping.cpus = 8
}
}
Standard profile is default
Profiles can overwrite more general instructions
Creating a pipeline | Channel operators
fastQC
mapping
startChannel
fastqcOut
Operatos can transform and connect other channels without interfering with processes
There are dozen of available operators
Creating a pipeline | Channel operators
Channel
.from( 'a', 'b', 'aa', 'bc', 3, 4.5 )
.filter( ~/^a.*/ )
.into { newChannel }
a
aa
Filter channels
Channel
.from( 1, 2, 3 )
.map { it * it }
.into {newChannel}
1
4
9
Map channels
Channel
.from( 'a', 'b', 'aa', 'bc', 3, 4.5 )
.into { newChannel1; NewChannel2 }
Fork channels
Channel
.from( 1, 2, 3, 4 )
.collect()
.into { OneChannel }
[1,2,3,4]
Collect channels
Creating a pipeline | Channel operators
process integrity_coverage {
input:
<inputs>
output:
set fastq_id,
file(fastq_pair),
file('*_encoding'),
file('*_phred'),
file('*_coverage'),
file('*_max_len') into MAIN_integrity
script:
template "integrity_coverage.py"
}
LOG_corrupted = Channel.create()
MAIN_PreCoverageCheck = Channel.create()
MAIN_integrity.choice(LOG_corrupted, MAIN_PreCoverageCheck) {
a -> a[2].text == "corrupt" ? 0 : 1
}
MAIN_outputChannel = Channel.create()
SIDE_phred = Channel.create()
SIDE_max_len = Channel.create()
MAIN_PreCoverageCheck
.filter{ it[4].text != "fail" }
.separate({{ output_channel }}, SIDE_phred_{{ pid }}, SIDE_max_len_{{ pid }}){
a -> [ [a[0], a[1]], [a[0], a[3].text], [a[0], a[5].text] ]
}
process report_corrupt {
input:
val fastq_id from LOG_corrupted.collect{it[0]}
output:
file 'corrupted_samples.txt'
"""
echo ${fastq_id.join(",")} | tr "," "\n" >> corrupted_samples.txt
"""
}
Example:
More information
Nextflow website: https://www.nextflow.io/
Nextflow docs: https://www.nextflow.io/docs/latest/index.html
Awesome nextflow: https://github.com/nextflow-io/awesome-nextflow
Thank you | for your attention
Acknowledgements
M. Ramirez Lab
INESC THOR
By Diogo Silva
INESC THOR
- 751