Data
Output
Data
Function
Software
- Version
- Options
Data
Function
Software
- Version
- Options
Output
Data
Any change in:
- Data
- Software version
- Command options
Can result in different output data
Results should be reproducible!
Parallelisation of tasks
is efficient
Parallelisation of tasks
is efficient
There are different HPC scheduling systems/executors
We should isolate software dependencies to give exact version of software to user
It is hard and error-prone to install software manually
Data
Packaged software
Guidelines how to process data with given software
Output
DSL - Domain specific Language
Data
Software package
Guidelines how to process data with given software
Guidelines how to process data with given software using specific executor
Submit tasks in parallel to the specified executor
Output
- Processes
- Channels
process makeSTARindex {
input:
file fasta from ch_fasta_for_star_index
output:
file "star" into star_index
script:
"""
mkdir star
STAR \\
--runMode genomeGenerate \\
--runThreadN ${task.cpus} \\
--sjdbGTFfile $gtf \\
--genomeDir star/ \\
--genomeFastaFiles $fasta \\
$avail_mem
"""
}
process makeSTARindex {
input:
file fasta from ch_fasta_for_star_index # I am a channel
output:
file "star" into star_index # I am also a channel
script:
"""
mkdir star
STAR \\
--runMode genomeGenerate \\
--runThreadN ${task.cpus} \\
--sjdbGTFfile $gtf \\
--genomeDir star/ \\
--genomeFastaFiles $fasta \\
$avail_mem
"""
}
process plink_to_vcf{
input:
set file(bed), file(bim), file(fam) from harmonised_genotypes
output:
file "harmonised.vcf.gz" into harmonised_vcf_ch
script:
"""
plink2 --bfile ${bed.simpleName} --recode vcf-iid --out ${bed.simpleName}
bgzip harmonised.vcf
"""
}
process vcf_fixref{
input:
file input_vcf from harmonised_vcf_ch
output:
file "fixref.vcf.gz" into filter_vcf_input
script:
"""
bcftools index ${input_vcf}
bcftools +fixref ${input_vcf} -- -f ${fasta} -i ${vcf_file} | \
bcftools norm --check-ref x -f ${fasta} -Oz -o fixref.vcf.gz
"""
}