a microbial bioinformatics case study
Computational Biology and Bioinformatics Seminars
30 November 2022
@ines_cim
cimendes
Inês Mendes
Instituto de Medicina Molecular João Lobo Antunes
Bacterial Population Genetics
Pathogenesis and Natural History of Infection
Outbreak Investigation and Control
Surveillance of Infectious Diseases
Magic box of NGS Wonders for Microbiology
Completely characterized strain:
Can person X, with the same data and the same methodology, obtain the same conclusions as person Y?
The needs:
Findable, Accessible, Interoperable, Reusable
https://doi.org/10.5281/zenodo.3332807
The quality of the form of the software can be covered by FAIR data principles
The quality of the functionality of the software goes beyond the FAIR principles:
Form versus function of software
Versioning, Collaboration and Accountability
Collaboration: VCS (such as Git) was designed to solve the problem of multiple people working on the same code.
Storing versions properly: A stage of the project can be saved, through a commit, alongside a message. In the end, you only have one version on your disk of a project. Everything else is neatly packed up inside the VCS.
Auditabillity: VCS offers you an easy way to track "who made what change and when", allowing you to (1) debug more effectively, (2) track the reason that certain changes were made and (3) find the person who made a change.
Reproducibility, Deployability
Runs the same regardless of the environment.
A container image is a lightweight, stand-alone, executable package of a software that includes everything needed to run it:
Host Hardware
Host Hardware
Container Engine
Host OS
Host OS
Hypervisor
Guest OS
Guest OS
App
App
Guest OS
VM1
VM2
App
App
App
App
Virtual Machines
Containers
build
pull & run
host
push
Reproducibility, Scalability, Sharability
The game changing combination of workflow managers and containers:
Workflows in the Modern era:
Enables scalable and reproducible scientific workflows. It simplifies the deployment of complex parallel and reactive workflows.
Reactive workflow framework
Create pipelines with asynchronous (and implicitly parallelized) data streams
Programing DSL
Has its own language for building a pipeline
Containerized
Out of the box integration with containers engines (Docker, Singularity, Shifter)
Accountability in Development
Design
Code
Test
Test
Code
Refactor
Traditional Technique Approach
Test Driven Development Approach
In the field of microbial bioinformatics, good software engineering practices are not yet widely adopted. (...) This paper serves as a resource that could help microbial bioinformaticians get started with software testing if they have not had formal training.
Include test files with known expected outcomes for a successful run.
Include files or other inputs on which the tool is expected to fail.
unittest (https://docs.python.org/3/library/unittest.html)
pytest (https://docs.pytest.org/en/stable/)
(https://testthat.r-lib.org/) testthat
The final frontier
de novo assembly benchmark
| Assembly
The assembly methods provide longer sequences that are more informative than shorter sequencing data and can provide a more complete picture of the microbial community in a given sample.
Reads
Contigs
Genomes
| de novo Assembly
Martin Ayling, Matthew D Clark, Richard M Leggett, New approaches for metagenome assembly with short reads, Briefings in Bioinformatics, Volume 21, Issue 2, March 2020, Pages 584–594, https://doi.org/10.1093/bib/bbz020
| de novo Assembly
Major issues
Reads
Contigs
Genomes
https://github.com/B-UMMI/LMAS
https://lmas.readthedocs.io/
| Last Metagenomic Assembler Standing
Automated workflow enabling the benchmarking of genomic and metagenomic prokaryotic de novo assembly software using defined mock communities.
| Last Metagenomic Assembler Standing
A container engine (Docker, singularity, shifter...).
apt-install docker-ceInstall LMAS
conda install -c bioconda LMASRun LMAS
LMAS --fastq <reads_{1,2}.fq.gz> --reference <reference.fasta>Thank you for
your attention
SFRH/BD/129483/2017