Towards Accreditation in Metagenomics for Clinical Microbiology

Thesis Comittee

Programa de Doutoramento do Centro Académico de Medicina de Lisboa

Inês Mendes

15th of September, 2020

Metagenomics

Random "shotgun" sequencing of microbial DNA, without selecting a particular gene.

Promising methodology for obtaining fast results for the identification of pathogens and their virulence and antimicrobial resistance properties without the need for culture.

| The motivation

Who is there? - Taxonomic identification

What are they doing? - Virulome, Resistome, Functional Annotation

Who is doing what? - Functional Assignment

Metagenomics

| The motivation

Development of novel methods and metrics to accurately identify and estimate relative abundance of viral and baterial pathogens.

Standardizing the process of metagenomic analysis.

Development of a computational efficient and robust framework.

Metagenomics

| The motivation

Main Goals

Fundação para a Ciência e a Tecnologia (SFRH/BD/129483/2017)

Dr. João A. Carriço - Instituto de Medicina Molecular, MRamirez Lab

Prof. Dr. John W. A. Rossen - University Medical Center Groninen, Department of Medical Microbiology

Metagenomics

| The motivation

Funding

Promoters & Host Institutions

| Analysis

Metagenomics

Precision, Sensibility & Performance

(Culture + Maldi-TOF)

11 Metagenomic Samples - Fluid & Tissue

| Analysis

Metagenomics

Highlights the potential and the limitations of shotgun metagenomics as a diagnostic tool - Lack of reproducibility

Results are highly dependent on the tools, and specially database, chosen for the analysis - Lack of standardization and propper benchmark.

| Analysis

Metagenomics

Major issues

Reproducibility in Computational Biology

Clinical Shotgun Metagenomic Analysis

Programa de Doutoramento do Centro Académico de Medicina de Lisboa

Inês Mendes

15th of September, 2020

Reproducibility

| The motivation

Do I know what is happening?
Is it reproducible?
Is it shareable?

Black

Box

Transparent

Box

Commercial/Freeware
You get what it gives you
Ready to use
Stealth change
Standalone

Freeware
You can "tailor"
"Major" headache
Visible change
Dependencies

The needs:

Analyze a large amount of sequence data routinely
Some computationally intensive steps
Constantly updating/adding software

Reproducibility

| The motivation

Writing of pipelines in python/perl/shell scripts circa 2000, colorized.

Custom ad-hoc scripts
Difficult to parallelise
Difficult to install/run
Hard to deploy in multiple environments
What's workflow managers?!
What's docker?!

Workflows in the Paleolithic era:

Reproducibility

| The motivation

The game changing combination of workflow managers and containers:

Portability
Reproducible
Scalability
Multi-scale containerization
Native cloud support

Reproducibility

| The motivation

Workflows in the Modern era:

Monolithic pipelines

Need to change often

Siloed tool containers

Don't do much by themselves

Reproducibility

| The motivation

Reproducibility

|

Workflow based development

Component based development

Components are modular pieces with some basic rules:

Component A

- Input/Output

- Parameters

- Resources

Component B

- Input/Output

- Parameters

- Resources

Reproducibility

|

With this framework, building workflows becomes simple:

flowcraft build -t 'trimmomatic fastqc spades pilon' -o my_nextflow_pipeline

Results in the following workflow DAG (direct acyclic graph)

It's easy to get experimental:

flowcraft build -t 'trimmomatic fastqc skesa pilon' -o my_nextflow_pipeline

Switch spades for skesa

Reproducibility

|

Forks

Connect one component to multiple

Secondary channels

Connect non-adjacent components

Extra inputs

Inject user input data anywhere

Recipes

Curated and pre-assembled pipelines for specific needs

Reproducibility

|

Reproducibility

|

Reproducibility

|

Multiple Raw Input Types

Not limited to paired-end FastQ or Fasta

Dynamic Input in Components

One component, multiple inputs

Expand Building Features

New merge operators

Component that receives multiple input types
Component that merges multiple inputs from different branches

Reproducibility

|

Programa de Doutoramento do Centro Académico de Medicina de Lisboa

Inês Mendes

15th of September, 2020

dengue virus genotyping from amplicon and shotgun metagenomic sequencing

| The motivation

doi:10.1038/nrmicro1690

Sequential infection increases the risk of a severe form of the infection - dengue hemorrhagic fever.

| The motivation

Dengue hemorrhagic fever:

The virus presents 4 different serotypes, sub-classified in many genotypes

After "primary" infection with one serotype, "secondary" infections by one or more of the other serotypes can precipitate ‘antibody dependant enhancement’ (ADE)

Infection with one type gives little immune protection against the other types

| The motivation

DENV-1 - genotypes I-V

DENV-2 - genotypes Asian I, Asian II, Cosmopolitan, American, Asian/American & Sylvatic

DENV-3 - genotypes I-V

DENV-4 - genotypes I - III & Sylvatic

https://doi.org/10.1371/journal.pntd.0001876.g002

https://doi.org/10.1371/journal.pntd.0000757

Thailand

Viet Nam

| The motivation

DENV: (+)ssRNA (~11Kb; 1 ORF)

The single polyprotein encodes:

Structural Proteins:
- C – capsid
- prM – pre-membrane
- M - membrane
- E - envelope
Non-Structural Proteins:
- NS1, NS2A, NS2B, NS3, NS4A, NS4B and NS5

| The motivation

Empower the use of HTS to monitor the dissemination of the disease

RNA Extraction

PCR
Amplification

HTS Sequencing

➔

How do the different genotypes model transmission and infection?

| The motivation

| The workflow

Ready to use but customizable
Scalable
Reproducible
Stand-alone (confidential data)
Easy to explore and share results

Requirements

A solution

| The workflow

DENV Identification

3830 complete DENV genomes from the NIAID Virus Pathogen Database and Analysis Resource (ViPR)
complete genome sequence
human & mosquito host (exception of DENV-1 III, monkey)
collection year (1950-2019)

In Silico Typing:

161 representative sequences of all sero and genotypes.

| The workflow

DENV-1

DENV-2

a) Envelope Region b) whole genome sequence

| DENV Classification

a) Envelope Region b) whole genome sequence

DENV-3

DENV-4

| DENV Classification

Shotgun Metagenomics dataset:

9 plasma samples
13 serum samples
1 spiked sample with the 4 serotypes
Positive and Negative controls

nextflow run DEN-IM.nf -profile slurm_shifter --fastq="fastq/*_{1,2}.*"

HPC with 300 Cores/Processing Power and 3 TB RAM
On average, each sample took 7 minutes to analyse. A total of 75 CPU hours were used to analyse the 25 samples, with a total of 17Gb in size. This analysis resulted in 69Gb of data.

| Use case

Git, Nextflow (java) and a container engine (Docker, singularity, shifter...).

apt-get install git

curl -s https://get.nextflow.io | bash

apt-install docker-ce

Clone (or run remotely)

git clone https://github.com/B-UMMI/DEN-IM.git

https://github.com/B-UMMI/DEN-IM/wiki

| Installation

(Meta)Genomic Assembly Benchmark

de novo Assembly of short-read data

Programa de Doutoramento do Centro Académico de Medicina de Lisboa

Inês Mendes

15th of September, 2020

The assembly methods provide longer sequences that are more informative than shorter sequencing data and can provide a more complete picture of the microbial community in a given sample.

Benchmark

| (Meta)Genomic assembly

Benchmark

| (Meta)Genomic assembly

Benchmark

| (Meta)Genomic assembly

Benchmark

| Assembly workflow

Reference Dataset (Complete Bacterial Genomes)

In silico mock sample (even)

In silico mock sample (log)

Zymos standard (even)

Zymos standard (log)

3.7 M read pairs

8.8 M read pairs

47.8 M read pairs

Assembly Workflow

Assembly Quality Assessment

Benchmark

| Assembly evaluation

Reference Dataset (Triple)

Assembly file (fasta)

Filter min contig size (1000 bp)

Mapping with Minimpa2

Read Data

PAF file (tab)

Benchmark

| Assembly evaluation

General Assembly & Global Mapping Statistics

Mapping Statistics & Metrics per Reference

Number of contigs
Number of basepairs
N50
Max contig size

Number of contigs (%)
Number of basepairs (%)
N50

% Mapped contigs
% Mapped basepairs
% Mapped reads

Original

Filtered (1000bp)

Contiguity
Identity (average)
Identity (min)
Breadth of coverage

C90 & C95
Number of aligned contigs
Phred quality score per contig

Benchmark

| Assembly evaluation

C90 & C95

Number of contigs to cover at least 90% and 95% of the reference genome, respectively.

Contig Phread Quality Score

E = 1 - Identity

Phred(E) = \begin{cases} -log(E) * 10 & \quad \text{if } E \text{< 0}\\ 60 & \quad \text{if } E \text{= 0} \end{cases}

Contiguity

Longest percentage of the reference sequence assembled in a single contig.

Benchmark

| Performace

Benchmark

| Mock sample (even)

Benchmark

| Mock sample (even)

Benchmark

| Mock sample (even)

Benchmark

| Mock sample (even)

Bradth of coverage and number of contigs per Reference for each Assembler

Benchmark

| Mock sample (even)

Benchmark

| Mock sample (even)

Contig Phred Quality Score for GATBMiniaPipeline's Pseudomonas aerugiona assemby

Contig Size

Phred Score

Phred(E) = \begin{cases} -log(E) * 10 & \quad \text{if } E \text{< 0}\\ 60 & \quad \text{if } E \text{= 0} \end{cases}

Benchmark

| Mock sample (even)

Contig Phred Quality Score per Reference for each Assembler

Data Structures & the SARS-CoV-2 Contextual Data Specification

Programa de Doutoramento do Centro Académico de Medicina de Lisboa

Inês Mendes

15th of September, 2020

PHA4GE

| Data structures workgroup

Standardized data structures and interchangable formats - critical to the development of an open software ecosystem.

Focus on the development, adaptation and standardization of data models for microbial sequence data, contextual metadata, results and workflow metrics to improve the transparency, interoperability and reproducibility of public health sequencing workflows.

Main Goal

PHA4GE

| SARS-CoV-2 Data Spec

SARS-CoV-2 contextual data specification that incorporates publicly available community standards, as well as additional fields and guidance appropriate for public health surveillance and analyses.

Collection template and controlled vocabulary pick lists
Reference guide
Collection template SOP
Data submission protocol (GISAID, NCBI, EMBL-EBI)
JSON structure of PHA4GE specification

Resources

PHA4GE

| SARS-CoV-2 Data Spec

Future Work

Continue benchmark analysis for the mock semple (log distributed) and the real samples (Zymos community standards log and evenly distributed).

PHA4GE - Harmonization of tool outputs for the detection of antimicrobial registance genes (https://github.com/pha4ge/hAMRonization).

Web service for the interactive vizualization of Kraken's taxonomic composition reports.

Nothing else Meta - Reference indenpendent filtration of human reads from (meta)genomic datasets.

Special thanks to Diogo Silva, Bruno Gonçalves, Tiago Jesus, Pedro Vila-Cerqueria, Rafael Maria Mamede, João Carriço, John Rossen and Mário Ramirez.

Thank you for your attention

Towards Accreditation in Metagenomics for Clinical Microbiology

By Inês Mendes

Towards Accreditation in Metagenomics for Clinical Microbiology

CAML PhD Program Thesis Comittee - 15 September 2020

Inês Mendes

Bioinformatics PhD student.

ines_cim