From reproducibility to actionability

a microbial bioinformatics case study

Computational Biology and Bioinformatics Seminars

30 November 2022

@ines_cim

cimendes

Inês Mendes

Instituto de Medicina Molecular João Lobo Antunes

Clinical Microbiology

Bacterial Population Genetics

Pathogenesis and Natural History of Infection

Outbreak Investigation and Control

Surveillance of Infectious Diseases

 | Components

Bioinformatics

 | Wishful thinking

Magic box of NGS Wonders for Microbiology

Completely characterized strain:

  • Identification & Typing
  • Antibiotic resistance profile
  • Virulence factors present
  • Other information
    • spa (S. aureus)
    • emm (GAS)

Can person X, with the same data and the same methodology, obtain the same conclusions as person Y? 

Black

Box

Glass

Box

  • Commercial/Freeware
  • You get what it gives you
  • Ready to use
  • Stealth change
  • Standalone
  • Freeware
  • You can "tailor"
  • "Major" headache
  • Visible change
  • Dependencies

Reproducibility

 | The needs

  • Do I know what is happening?
  • Is it reproducible?
  • Is it shareable?

Reproducibility

 | The needs

The needs:

  • Analyze a large amount of sequence data routinely
  • Some computationally intensive steps
  • Constantly updating/adding software

Reproducibility

 | The needs

Reproducibility

 | The challenges

  • No standard way of describing experiments, environments, (derived) data, and workflows.
  • No transparency in creating environments and steps/methods to recreate analysis.
  • The experimental nature of the research code and ecosystem makes it often hard to build.
  • Unresolved or undocumented dependencies.
  • Infrastructure for storage and distribution.

 

What is our role, as computational biologists, in addressing these challenges?

 

FAIR Data Principles

Findable, Accessible, Interoperable, Reusable

FAIR Principles

 | Not just for data

https://doi.org/10.5281/zenodo.3332807

FAIR Principles

 | Not just for data

The quality of the form of the software can be covered by FAIR data principles

  • Code quality 
  • Maintainability

The quality of the functionality of the software goes beyond the FAIR principles:

  • Correctness
  • Security
  • Efficiency

Form versus function of software

Version Control

Versioning, Collaboration and Accountability

Version Control

 | What is it?

Version Control

 | What is it?

Version Control

 | What does it allow for?

Collaboration: VCS (such as Git) was designed to solve the problem of multiple people working on the same code.

Storing versions properly: A stage of the project can be saved, through a commit, alongside a message. In the end, you only have one version on your disk of a project. Everything else is neatly packed up inside the VCS.

Auditabillity: VCS offers you an easy way to track "who made what change and when", allowing you to (1) debug more effectively, (2) track the reason that certain changes were made and (3) find the person who made a change.

Container Software

Reproducibility, Deployability

Software Containers

Runs the same regardless of the environment.

A container image is a lightweight, stand-alone, executable package of a software that includes everything needed to run it:

  • code
  • runtime
  • system tools
  • system libraries

 | What is it?

Software Containers

 | What is it?

Host Hardware

Host Hardware

Container Engine

Host OS

Host OS

Hypervisor

Guest OS

Guest OS

App

App

Guest OS

VM1

VM2

App

App

App

App

Virtual Machines

Containers

Docker Hub

build

pull & run

host

push

Docker

 | The ecosysthem

Workflow Managers

Reproducibility, Scalability, Sharability

The game changing combination of workflow managers and containers:

  • Portability
  • Reproducible
  • Scalability
  • Multi-scale containerization
  • Native cloud support

Reproducibility

 | The motivation

Workflows in the Modern era:

Workflow Managers

 | What is it?

Enables scalable and reproducible scientific workflows. It simplifies the deployment of complex parallel and reactive workflows.

Reactive workflow framework

Create pipelines with asynchronous (and implicitly    parallelized) data streams

Programing DSL

Has its own language for building a pipeline

Containerized

Out of the box integration with containers engines (Docker, Singularity, Shifter)

Monolithic pipelines

Need to change often

Siloed tool containers

Don't do much by themselves

Software Testing

Accountability in Development

Code Development

 | Approaches

Design

Code

Test

Test

Code

Refactor

Traditional Technique Approach

Test Driven Development Approach

In the field of microbial bioinformatics, good software engineering practices are not yet widely adopted. (...) This paper serves as a resource that could help microbial bioinformaticians get started with software testing if they have not had formal training.

Recommendation #1

Establish software needs and testing goals

Recommendation #2

Input test files: the good, the bad, and the ugly

Include test files with known expected outcomes for a successful run.

 

Include files or other inputs on which the tool is expected to fail.

Recommendation #3

Use an established framework to implement testing

unittest (https://docs.python.org/3/library/unittest.html)

pytest (https://docs.pytest.org/en/stable/)

(https://testthat.r-lib.org/) testthat

Recommendation #3

Recommendation #4

Testing is good, automated testing is better

Recommendation #5

 

Ensure portability by testing on several platforms

Recommendation #6

Showcase the tests

Actionable Reports

The final frontier

Actionability

 | The interactive reports

  • Intuitive and responsive reports enable collaborative research and empower users across domains.
  • There's a tradeoff between ease of deployment and
    ease of use

Actionability

 | Rmarkdown

Actionability

 | Streamlit

Actionability

 | Datapane

Actionability

 | Javascript

LMAS

de novo assembly benchmark

Microbial (meta)genomics

Microbial (meta)genomics

| Assembly

The assembly methods provide longer sequences that are more informative than shorter sequencing data and can provide a more complete picture of the microbial community in a given sample.

Reads

Contigs

Genomes

Microbial (meta)genomics

| de novo Assembly

Martin Ayling, Matthew D Clark, Richard M Leggett, New approaches for metagenome assembly with short reads, Briefings in Bioinformatics, Volume 21, Issue 2, March 2020, Pages 584–594, https://doi.org/10.1093/bib/bbz020

Microbial (meta)genomics

| de novo Assembly

  • Results are highly dependent on the tools chosen for the analysis - Lack of standardization and proper benchmark.

Major issues

  • Highlights the potential and the limitations of shotgun metagenomics as a diagnostic tool - Lack of reproducibility

Reads

Contigs

Genomes

https://github.com/B-UMMI/LMAS

https://lmas.readthedocs.io/

LMAS

| Last Metagenomic Assembler Standing

Automated workflow enabling the benchmarking of genomic and metagenomic prokaryotic de novo assembly software using defined mock communities.

LMAS

| Last Metagenomic Assembler Standing

A container engine (Docker, singularity, shifter...).

apt-install docker-ce

Install LMAS

conda install -c bioconda LMAS

Run LMAS

LMAS --fastq <reads_{1,2}.fq.gz> --reference <reference.fasta>

Towards Accreditation in Metagenomics for clinical Microbiology

Thank you for

your attention

SFRH/BD/129483/2017

Made with Slides.com