Inês Mendes
Bioinformatics PhD student.
Towards Accreditation in Metagenomics for Clinical Microbiology
Catarina Inês Marques de Sousa Mendes
Programa de Doutoramento do Centro Académico de Medicina de Lisboa
Advisors:
Professor Doutor João André Nogueira Custódio Carriço
Professor Doutor Mário Nuno Ramos de Almeida Ramirez
Global number of deaths (A) and Years of Life Lost (B), by pathogen and infectious syndrome, in 2019 Adapted from Ikuta et al., 2022, on behalf of GBD 2019 Antimicrobial Resistance Collaborators
Microbial pathogens are responsible for more than 400 million Years of Life Lost (YLL), a higher burden than either cancer or cardiovascular disease (Global Burden of Disease 2019).
In addition to the emergence of virulent pathogens, the rise of resistance poses a major threat to human health worldwide.
The still ongoing COVID-19 pandemic has, to date, claimed over 6 million lives (WHO).
World Health Organisation Global Priority Pathogens list. This catalogue includes, besides Mycobacterium tuberculosis considered the number one global priority, a list of twelve microorganisms grouped under three priority tiers according to their antimicrobial resistance: critical (Acinetobacter baumannii, Pseudomonas aeruginosa and Enterobacteriaceae), high (Enterococcus faecium, Helicobacter pylori, Salmonella species, Staphylococcus aureus, Campylobacter species and Neisseria gonorrhoeae), and medium (Streptococcus pneumoniae, Haemophilus influenzae and Shigella species). The major objective was to encourage the prioritisation of funding and incentives, align research and development priorities of public health relevance, and garner global coordination in the fight against antimicrobial-resistant bacteria. Adapted from World Health Organization, 2017.
Clinical microbiology can be defined as the characterisation of pathogen samples to direct the management of individual infected patients (diagnostics) and monitor the epidemiology of infectious diseases (public health).
Bacterial Population Genetics
Pathogenesis and Natural History of Infection
Outbreak Investigation and Control
Surveillance of Infectious Diseases
Principles of current processing of bacterial pathogens. Schematic representation of the current workflow for processing samples for bacterial pathogens is presented, with high complexity and a typical timescale of a few weeks to a few months. Samples that are likely to be normally sterile are often cultured on rich medium that will support the growth of any culturable organism. Samples contaminated with colonising flora present a challenge for growing the infecting pathogen. Many types of culture media (referred to as selective media) are used to favour the growth of the suspected pathogen. Once an organism is growing, the likely pathogens are then processed through a complex pathway that has many contingencies to determine species and antimicrobial susceptibility. Broadly, there are two approaches. One approach uses MALDI-TOF for species identification prior to setting up susceptibility testing. The other uses Gram staining followed by biochemical testing to determine species; susceptibility testing is often set up simultaneously with doing biochemical tests. Lastly, depending on the species and perceived likelihood of an outbreak, a small subset of isolates may be chosen for further investigation using a wide range of typing tests. Adapted from Didelot et al., 2012
The three revolutions in sequencing technology that have transformed the landscape of bacterial genome sequencing. The first-generation, also known as Sanger sequencers, is represented by the ABI Capillary Sequencer (Applied Biosystems). The second-generation, also known as high-throughput sequencers, is represented by the MiSeq, a 4-channel sequencer, and the NextSeq, a 2-channel sequencer (Illumina), both sequencing by synthesis instruments. These instruments allow the sequencing of both ends of the DNA fragment. Lastly, the third-generation, also known as long-read sequencers, is represented by Pacific Bioscience BS sequencer and Oxford Nanopore MinION sequencer. Adapted from Hagemann, 2015; Nicholas J. Loman and Pallen, 2015; Goodwin et al., 2016; Wang et al., 2021; Metzker, 2010; Xu et al., 2020.
Principles of current processing of bacterial pathogens based on whole-genome sequencing. Schematic representation of the workflow for processing samples for bacterial pathogens after the adoption of whole-genome sequencing, with an expected timescale that could fit within a single day. The culture steps would be the same as currently used in a routine microbiology laboratory. Once a likely pathogen is ready for sequencing, DNA will be extracted, taking as little as 2 hours to prepare the DNA for sequencing. After sequencing, the main processes for yielding information will be computational. Automated sequence assembly algorithms are necessary for processing the raw sequence data, from which species, relationship to other isolates of the same species, antimicrobial resistance profile and virulence gene content can be assessed. All the results will also be used for outbreak detection and infectious diseases surveillance Adapted from Didelot et al., 2012
Hypothetical workflow based on metagenomic sequencing. Schematic representation of the hypothetical workflow for the direct processing of samples from suspected sources of pathogens after adoption of metagenomic sequencing, with an expected timescale that could fit within a single day. Adapted from Didelot et al., 2012
Typical bioinformatic analysis procedure for metagenomic data.
Possibility of identifying and characterising a potential pathogen without the need for a priori knowledge of the causative agent of disease.
One of the biggest challenges when dealing with metagenomic data is the lack of golden standards. This is also applicable to the bioinformatic analysis required due to the amount of data produced.
The complexity and major pitfalls, such as reproducibility and transparency of the analysis methods, are the biggest hindrances.
Evaluate the current impact and applicability of metagenomics in medical microbiology, both in clinical and surveillance and infection prevention settings;
Develop novel methods and metrics to accurately identify and estimate the relative abundance of pathogens of interest through a hybrid approach of read mapping and de novo assembly methods;
Standardise the process of metagenomic analysis, allowing the comparison of results obtained across domains and stakeholders;
Develop computationally efficient and robust frameworks that allow scientists and/or medical experts with limited programming experience to rapidly and easily query the abundance of specific taxa and genes across the samples of interest, obtaining simple and intuitive reports.
1
2
3
4
Part I
Part II
Part III
Part IV
Detection of a novel mcr-5.4 gene variant in hospital tap water by shotgun metagenomic sequencing*
DEN-IM: Dengue virus genotyping from shotgun and targeted metagenomics*
LMAS: Last Metagenomic Assembler
Standing*
Software testing in microbial bioinformatics: a call to action*
Part I
Applying metagenomics in the clinical context
Couto et al., 2018 Sci Rep DOI: https://doi.org/10.1038/s41598-018-31873-w
Fleres et al., 2019 JAC DOI: https://doi.org/10.1093/jac/dkz363
Mendes et al., 2020 Microbial Genomics DOI: https://doi.org/10.1099/mgen.0.000328
Scheme of the bioinformatic analysis of the metagenomics samples. In order to evaluate and compare the accuracy and reliability of the bioinformatics analyses in providing the closest results to culture and WGS of any cultured isolates, three different pipelines (two commercially and one freely available) were used (Fig. 1). Different tools to perform raw read quality control, filtering and trimming were used and reads were mapped against the human genome (hg19) before performing taxonomic classification. Reads mapping to hg19 were removed from the analysis to increase the efficiency of the bioinformatics tools. Typing (MLST), phylogenetic analysis, plasmid analysis, detection of antimicrobial resistance and virulence genes was performed. To determine the appropriateness of shotgun metagenomics as a predictor of the WGS (chromosome and plasmids), Shotgun metagenomics results obtained were compared with the results of WGS of any bacterial isolates obtained from culturing the sample. Source: Couto et al,. 2018.
Couto et al., 2018 Sci Rep DOI: https://doi.org/10.1038/s41598-018-31873-w
Couto et al., 2018 Sci Rep DOI: https://doi.org/10.1038/s41598-018-31873-w
Lack of reproducibility | Highlights the potential and the limitations of shotgun metagenomics as a diagnostic tool.
Lack of standardisation and proper benchmark | Results are highly dependent on the tools, and specially database, chosen for the analysis.
Comparative analysis of the genetic environment of mcr-5 between the reference plasmid pSE13-SA01718 (accession no. KY807921.1) and the annotated hybrid metagenome contig (accession no. MK965519). The contig carrying the mcr-5.4 gene consists of the following putative gene products: 7-carboxy-7-deazaguanine synthase (queE), 7-cyano-7-deazaguanine synthase (queC), glycine cleavage system transcriptional antiactivator GcvR (gcvR), thiol peroxidase (tpx), sulphurtransferase TusA family protein (sirA), hypothetical protein (hp), truncated MFS-type transporter (Dmsf), lipid A phosphoethanolamine transferase (mcr-5.4), ChrB domain protein (chrB), transposon resolvase (tnpR) and truncated transposon transposase (DtnpA). Areas with 98% identity between sequences are represented in light grey. Arrows indicate the position and direction of the genes. The transposon Tn6452 sequence in the reference plasmid pSE13-SA01718 is bounded by inverted repeats: IRL and IRR.Source: Fleres et al., 2019.
Fleres et al., 2019 JAC DOI: https://doi.org/10.1093/jac/dkz363
Sequencing pitfalls| Even when hybrid assembly is employed, complete genomic sequences, particularly chimeric ones such as plasmids, are not fully recovered.
Lack of standardisation and proper benchmark | No standard procedure was followed, being the bioinformatic analysis performed ad hoc.
DENV-3 - genotypes I-V
https://doi.org/10.1371/journal.pntd.0001876.g002
https://doi.org/10.1371/journal.pntd.0000757
Sequential infection increases the risk of a severe form of the infection - dengue hemorrhagic fever.
Requirements
A solution
Mendes et al., 2020 Microbial Genomics DOI: https://doi.org/10.1099/mgen.0.000328
https://github.com/B-UMMI/DEN-IM
DENV Identification
In Silico Typing:
Reproducibility is key| Leveraging the use of container software with workflow managers enables reproducible and collaborative research.
Mendes et al., 2020 Microbial Genomics DOI: https://doi.org/10.1099/mgen.0.000328
Part II
Impact of de novo assemblers in metagenomics
Mendes et al., 2023 GigaScience DOI: https://doi.org/10.1093/gigascience/giac122
Approaches to de novo genome assembly. In Overlap, Layout, Consensus assembly, (1) overlaps are found between reads and an overlap graph constructed (edges indicate overlapping reads). (2) Reads are laid out into contigs based on the overlaps (lines indicate overlapping portions). (3) The most likely sequence is chosen to construct the consensus sequence. In the De Bruijn graph assembly, (1) reads are decomposed into kmers of a determined size by sliding a window of size k (in here of k=3) across the reads. (2) The kmers become vertices in the De Bruijn graph, with edges connecting overlapping kmers. Polymorphisms (red) form branches in the graph. A count is kept of how many times a kmer is seen, shown here as the numbers above kmers. (3) Contigs are built by walking the graph from the edge nodes. A variety of heuristics handle branches in the graphs—for example, low coverage paths, as shown here, may be ignored. Adapted from Ayling et al., 2020.
https://github.com/B-UMMI/LMAS
https://lmas.readthedocs.io/
The input data is assembled in parallel by the set of genomic and metagenomic de novo assemblers in LMAS.
The global and per reference metrics are grouped in the interactive LMAS report for exploration.
The resulting assembled sequences are processed and assembly quality metrics are computed.
| Sample | Distribution | Error Model | Read Pairs (M) |
|---|---|---|---|
| ENN | Even | None | 8.6 |
| EHS | Even | Illumina HiSeq | 8.6 |
| ERR2984773 | Even | Real Sample | 8.6 |
| LNN | Log | None | 47.5 |
| LHS | Log | Illumina HiSeq | 47.5 |
| ERR2935805 | Log | Real Sample | 47.5 |
Mendes et al., 2023 GigaScience DOI: https://doi.org/10.1093/gigascience/giac122
Mendes et al., 2023 GigaScience DOI: https://doi.org/10.1093/gigascience/giac122
Mendes et al., 2023 GigaScience DOI: https://doi.org/10.1093/gigascience/giac122
Part III
Griffiths et al., 2022 GigaScience DOI: https://doi.org/10.1093/gigascience/giac003
Challenges of data availability in metagenomics and beyond
Stakeholders need results in a single consistent format | This allows not only the comparison of tools and databases but the validation of results through multiple detection algorithms.
Griffiths et al., 2022 GigaScience DOI: https://doi.org/10.1093/gigascience/giac003
(Meta)Data is crucial to be F.A.I.R. for usability| The FAIR Data Principles: Findable, Accessible, Interoperable, and Reusable
Part IV
van der Putten et al., 2022 Microbial Genomics DOI: https://doi.org/10.1099/mgen.0.000790
Crowdsourcing for software robustness in metagenomics and beyond
van der Putten et al., 2022 Microbial Genomics DOI: https://doi.org/10.1099/mgen.0.000790
There's a critical lack of reliability and transparency| The use of software testing ensures not only that the tool is working as expected, but how it can be leveraged to be used as proxy for workability.
Lack of reproducibility and standardisation is a major hinder in metagenomics for clinical microbiology.
Even with the use of long-reads, complex genomic regions, such as chimeric plasmids, are a challenge to retrieve.
Leveraging the use of container software with workflow managers represents the current best standard for reproducible research.
Intuitive and responsive reports enable collaborative research and empowers users across domains.
Benchmark of basal tools, such as de novo assemblers, highlights the need for proper software assessment.
Standard specifications, such as for AMR or SARS-COV-2, are required for the comparison of results across stakeholders in different domains.
Crowdsourcing for better standards represents a viable way to adopt better practices for the use of metagenomics in clinical microbiolgy.
Thank you for
your attention
COVID/BD/152583/2022
SFRH/BD/129483/2017
By Inês Mendes