INESC THOR

THOR 17

3/07/2018

DevOps and workflow automation in bioinformatics and computational biology

Diogo N. Silva and Bruno Ribeiro-Gonçalves

M. Ramirez Lab

Why?

Analyze large amount of sequence data routinely
Some computationally intensive steps
Constantly updating/adding software

How?

Create a pipeline once, run it everywhere!
Containerized software and version control
Pipeline reproducibility
Parallelization
Modularity

What is NextFlow

Nextflow is a reactive workflow framework and a programming DSL that eases writing [containerized] computational pipelines.

Reactive workflow framework

Create pipelines with asynchronous (and implicitly parallelized) data streams

Programming DSL

Has it's own (simple) language for building a pipeline

Containerized

Integration with container engines (Docker, Singularity, Shifter) out of the box

Nextflow | Requirements

Bash
Java 1.8
Nextflow

curl -s https://get.nextflow.io | bash

sudo apt-get install openjdk-8-jdk

Ubiquitous on UNIX. Windows: Cygwin or Linux subsystem... maybe...

Optional (but highly recommend):

Mandatory:

Container engine (Docker, Singularity)

Disclaimer:

Creation of Nextflow pipelines was designed for bioinformaticians familiar with programming.

Execution of Nextflow pipelines if for everyone

Nextflow | why bother?

- No need to manage temporary input/output directories/files

- No need for custom handling of concurrency (parallelization)

- A single pipeline with support for any scripting languages (Bash, Python, Perl, R, Ruby, all of them!)

- Every process (task) can be run in a container

- Portable -> abstracts pipeline building from execution (Same pipeline runs on my laptop, server, clusters, etc)

- Checkpoints and resume functionality

- Streaming capability

- Host pipeline on github and run remotely!

Process B

Process A

Channel 2

Channel 1

Creating a pipeline | key concepts

Processes

- Building blocks of the pipeline - Can contain Unix shell commands or code written in any scripting language (Python, Perl, R, Bash, etc).

- Executed independently and in parallel.

- Contain directives for container to use, CPU/RAM/Disk limits, etc.

Channels

- Unidirectional FIFO communication channels between processes.

RAW data

Creating a pipeline | Anatomy of a process


process <name> {

    input:
    <input channels>

    // Optional
    output:
    <output channels>

    """
    <command/code block>
    """

}

Creating a pipeline | Anatomy of a process


process someName {

    input:
    val x from Channel.value(1)
    file y from Channel.fromPath("/path/to/file")
    set x, file(fl) from Channel.fromFilePairs("/path")

    // Optional
    output:
    val "some_str" into outputChannel1
    set x, file("*_regex.fas") into outputChannel2

    """
    echo Code goes here (bash by default)
    echo Access input variables like $x or ${y}
    bash_var="2"
    echo And access bash variables like \$bash_var
    """

}

1. Set process name

2. Set one or more input channels

3. Output channels are optional

4. The code block is always last

Creating a pipeline | Anatomy of a process


process someName {

    input:
    val x from Channel.value(1)
    file y from Channel.fromPath("/path/to/file")
    set x, file(y) from Channel.fromFilePairs("/path")

    // Optional
    output:
    val "some_str" into outputChannel1
    set x, file("*_regex.fas") into outputChannel2

    """
    #!/usr/bin/python3

    print("Some python code")
    print("Access variables like ${x} and ${y}")
    """

}

1. Set process name

2. Set one or more input channels

3. Output channels are optional

4. The code block is always last

5. You can change the interpreter!

Creating a pipeline | Anatomy of a process

process someName {

    input:
    val x from Channel.value(1)
    file y from Channel.fromPath("/path/to/file")
    set x, file(y) from Channel.fromFilePairs("/path")

    // Optional
    output:
    val "some_str" into outputChannel1
    set x, file("*_regex.fas") into outputChannel2

    script:
    """
    template my_script.py
    """
}

1. Set process name

2. Set one or more input channels

3. Output channels are optional

4. The code block is always last

5. Or use templates!

#!/usr/bin/python

# Store this in templates/my_script.py

def main():
    
    print("Variables are here ${x} ${y}")

main()

Creating a pipeline | simple example

startChannel = Channel.fromFilePairs(params.fastq)

process fastQC {

	input:
	set sampleId, file(fastq) from startChannel

	output:
	set sampleId, file(fastq) into fastqcOut

	"""
	fastqc --extract --nogroup --format fastq \
	--threads ${task.cpus} ${fastq}
	"""
}

process mapping {

	input:
	set sampleId, file(fastq) from fastqcOut
	each file(ref) from Channel.fromPath(params.reference)

	"""
	bowtie2-build --threads ${task.cpus} \
        ${ref} genome_index

	bowtie2 --threads ${task.cpus} -x genome_index \
	-1 ${fastq[0]} -2 ${fastq[1]} -S mapping.sam
	"""
}

fastQC

mapping

startChannel

fastqcOut

reference

Creating a pipeline | Costumize processes

process mapping {

        container "ummidock/bowtie2_samtools:1.0.0-1"
        cpus 1
        memory "1GB"
        publishDir "results/bowtie/"

	input:
	set sampleId, file(fastq) from fastqcOut
	each file(ref) from Channel.fromPath(params.reference)

        output:
        file '*_mapping.sam'

	"""
	<command block>
	"""
}

fastQC

mapping

startChannel

AssemblerOut

CPU: 2

Mem: 3Gb

CPU: 1

Mem: 1Gb

- Set different containers for each process

- Set custom CPU/RAM profiles that maximize pipeline performance

- Dozens of more options at directives documentation

Creating a pipeline | Customize everything

// nextflow.config

params {
    fastq = "data/*_{1,2}.*"
	reference = "ref/*.fasta"
}

process {

    $fastQC.container = "ummidock/fastqc:0.11.5-1"
    $mapping.container = "ummidock/bowtie2_samtools:1.0.0-1"

    $mapping.publishDir = "results/bowtie/"

    $fastQC.cpus = 2
    $fastQC.memory = "2GB"

    $mapping.cpus = 4
    $mapping.memory = "2GB"
}


profiles {
    standard {
        docker.enabled = true
    }
}

nextflow.config

- Parameters

- Process directives

- Docker options

- Profiles

Creating a pipeline | Deploy on GitHub

Creating a pipeline | Profiles

// nextflow.config

profiles {
    standard {
        docker.enabled = true
    }

    lobo {
        process.executor = "slurm"
        shifter.enabled = true

        process.$mapping.cpus = 8
    }
}

Standard profile is default

Profiles can overwrite more general instructions

Creating a pipeline | Channel operators

fastQC

mapping

startChannel

fastqcOut

Operatos can transform and connect other channels without interfering with processes

There are dozen of available operators

Creating a pipeline | Channel operators

Channel
    .from( 'a', 'b', 'aa', 'bc', 3, 4.5 )
    .filter( ~/^a.*/ )
    .into { newChannel }

a
aa

Filter channels

Channel
    .from( 1, 2, 3 )
    .map { it * it  }
    .into {newChannel}

1
4
9

Map channels

Channel
    .from( 'a', 'b', 'aa', 'bc', 3, 4.5 )
    .into { newChannel1; NewChannel2 }

Fork channels

Channel
    .from( 1, 2, 3, 4 )
    .collect()
    .into { OneChannel }

[1,2,3,4]

Collect channels

Creating a pipeline | Channel operators

process integrity_coverage {

    input:
    <inputs>

    output:
    set fastq_id,
        file(fastq_pair),
        file('*_encoding'),
        file('*_phred'),
        file('*_coverage'),
        file('*_max_len') into MAIN_integrity

    script:
    template "integrity_coverage.py"

}

LOG_corrupted = Channel.create()
MAIN_PreCoverageCheck = Channel.create()

MAIN_integrity.choice(LOG_corrupted, MAIN_PreCoverageCheck) {
    a -> a[2].text == "corrupt" ? 0 : 1
}

MAIN_outputChannel = Channel.create()
SIDE_phred = Channel.create()
SIDE_max_len = Channel.create()

MAIN_PreCoverageCheck
    .filter{ it[4].text != "fail" }
    .separate({{ output_channel }}, SIDE_phred_{{ pid }}, SIDE_max_len_{{ pid }}){
        a -> [ [a[0], a[1]], [a[0], a[3].text], [a[0], a[5].text]  ]
    }


process report_corrupt {

    input:
    val fastq_id from LOG_corrupted.collect{it[0]}

    output:
    file 'corrupted_samples.txt'

    """
    echo ${fastq_id.join(",")} | tr "," "\n" >> corrupted_samples.txt
    """

}

Example:

More information

Nextflow website: https://www.nextflow.io/

Nextflow docs: https://www.nextflow.io/docs/latest/index.html

Awesome nextflow: https://github.com/nextflow-io/awesome-nextflow

Thank you | for your attention

Acknowledgements

M. Ramirez Lab