for beginners

Agenda:

  • What is the challenge
  • What is Nextflow (NF)
  • Nextflow basics
  • Superpowers
  • Existing pipelines and good boilerplates
  • Q&A

What is the challenge?

Data

Output

Data

Function

Software

- Version

- Options

Data

Function

Software

- Version

- Options

Output

Data

Any change in:

- Data

- Software version

- Command options

Can result in different output data

Results should be reproducible!

Parallelisation of tasks

is efficient

Parallelisation of tasks

is efficient

There are different HPC scheduling systems/executors

We should isolate software dependencies to give exact version of software to user

It is hard and error-prone to install software manually

GoalĀ 

Data

Packaged software

Guidelines how to process data with given software

Output

DSL - Domain specific Language

What isĀ 

Data

Software package

Guidelines how to process data with given software

Guidelines how to process data with given software using specific executor

Submit tasks in parallel to the specified executor

Output

Nextflow basics

- Processes

- Channels

Process

process makeSTARindex {
    input:
    file fasta from ch_fasta_for_star_index

    output:
    file "star" into star_index

    script:
    """
    mkdir star
    STAR \\
        --runMode genomeGenerate \\
        --runThreadN ${task.cpus} \\
        --sjdbGTFfile $gtf \\
        --genomeDir star/ \\
        --genomeFastaFiles $fasta \\
        $avail_mem
    """
}
  • Takes input
  • Executes script
  • Passes output of script
  • Has its own needs
  • Has one specific job
  • Fires automatically when input is available

Channel

process makeSTARindex {
    input:
    file fasta from ch_fasta_for_star_index # I am a channel

    output:
    file "star" into star_index # I am also a channel

    script:
    """
    mkdir star
    STAR \\
        --runMode genomeGenerate \\
        --runThreadN ${task.cpus} \\
        --sjdbGTFfile $gtf \\
        --genomeDir star/ \\
        --genomeFastaFiles $fasta \\
        $avail_mem
    """
}
  • Passes data between processes asynchronously
  • There are multiple operation you can do with channels
    • merge, mix, fork, combine, divide etc.
  • There are different types of channels

Channel

process plink_to_vcf{
    input:
    set file(bed), file(bim), file(fam) from harmonised_genotypes

    output:
    file "harmonised.vcf.gz" into harmonised_vcf_ch

    script:
    """
    plink2 --bfile ${bed.simpleName} --recode vcf-iid --out ${bed.simpleName}
    bgzip harmonised.vcf
    """
}

process vcf_fixref{
    input:
    file input_vcf from harmonised_vcf_ch
    
    output:
    file "fixref.vcf.gz" into filter_vcf_input

    script:
    """
    bcftools index ${input_vcf}
    bcftools +fixref ${input_vcf} -- -f ${fasta} -i ${vcf_file} | \
     bcftools norm --check-ref x -f ${fasta} -Oz -o fixref.vcf.gz
    """
}

Superpowers?

Made with Slides.com