Methods and applications of graph
genomes to pathogens and public
health

Michael B. Hall
Email: michael.hall@ebi.ac.uk


Second Year Report
Thesis Advisory Committee:
Zamin Iqbal - EMBL-EBI
John Marioni (Chair) - EMBL-EBI
Georg Zeller - EMBL Heidelberg
Estée Török - University of Cambridge

Presentation Overview

1. Executive Summary

2. Overview of Previous Activities

3. Motivation and Background

4. Thesis Plan

5. Publication Strategy

1. Executive Summary

Driving factors

Genomics is now ubiquitous in clinical and public health microbiology. However, many significant challenges remain:

• Bacterial genomes harbour vast amounts of diversity, even within a species, and traditional reference-based approaches are problematic.

• Much of the variation in bacteria is fundamentally inaccessible to short reads.

• Long Nanopore reads are noisy, and SNP calling with this data is not adequately benchmarked or standardised.

• Since Mycobacterium tuberculosis infects so many people, there is potential for considerable impact for clinical applications.

• There is also much to be gained from a high-quality pan-genome of M. tuberculosis as well as a detailed map of its enigmatic pe/ppe gene repertoire.

Aims of this PhD

1. Develop algorithms and software for variant discovery using bacterial genome graphs, building on work of a previous student in the lab.

2. Benchmark Nanopore versus Illumina SNP calling, show our algorithms meet the needs of clinical and public health users, validate and publish.

3. Improve upon current whole-genome sequencing-based drug resistance prediction for M. tuberculosis using genome graphs.

4. Curate a high-quality reference pan-genome for M. tuberculosis that includes a detailed map of the pe/ppe genes.

stacked bars? overview of what is in each chapter and what ive done etc.

3. Motivation and Background

  • 10.4 million (estimated) cases in 2016

  • 400,000 of those MDR

  • #1 cause of death by a single pathogen

  • Standard of care requires phenotypic testing (DST) of the infecting organism (M. tuberculosis) against the four first-line drugs

  • M. tuberculosis is slow-growing - gold-standard DST takes ~2 months

Potential for considerable impact in clinical applications

Tuberculosis

Source: WHO
  • WGS-based diagnostics offer faster solution

  • Two-week "liquid culture" gives similar results

  • For the four first-line drugs, a study by the CRyPTIC consortium demonstrate that DST is not required if genotype predicts susceptibility

However, as the genetic basis for drug resistance is not entirely understood, there is still a sensitivity gap that differs drug-by-drug.

Tuberculosis

Source: WHO
  • Public health requirements for TB diagnostics are resistance prediction, species identification, and clustering of genomes.

  • Requirements currently met with Illumina

  • Reasons to consider switching to Nanopore: cost, location of burden, speed

  • Patient to result in 12.5 hours (2017) with Nanopore - and yield has improved since then

TB and Public Health

Nanopore sequencing

  • Read identity 87-94%

  • Consensus identity up to 99.94%

  • Variant calling in its infancy (medaka and nanopolish), but no extensive benchmark has been completed

  • Read length in 10's of kilobases median

Bacterial genomes are incredibly diverse

Pan-genome

In an "open" pan-genome, such as Salmonella enterica, two individuals could share as little as 16% of their genes

The single-reference problem

Genome Graphs

  • Uses a population reference graph (PRG) instead of a single, linear reference

  • PRG represents variation seen within a population

  • Two forms - local and pan-PRG

Genome Graphs + Nanopore

Rachel Colquhoun

Pandora - pan-genome inference and genotyping with long-noisy or short-accurate reads

De novo variant calling not included

4.1 Chapter 1

Variant discovery in genome graphs

Develop algorithms and software for variant discovery using bacterial genome graphs, building on work of a previous student in the lab.

What to use as a reference for a species with a pan-genome?

Reference sequence for a specific strain?

Works provided all of your samples are of the same strain

Use the species' core-genome?

Miss all non-core information, which can be a lot in an open pan-genome

Pandora - map

Infer consensus sequence for a single sample and genotype with respect to this consensus sequence

De novo variant calling not included

Pandora - compare

Rachel Colquhoun

Infer consensus sequence for a collection of samples and genotype with respect to this consensus sequence

Allows genotyping across the entire genome

  • The work in my first chapter outlines a method for removing this limitation within pandora and provides an analysis of the gain in recall and precision by incorporating de novo variant discovery into the pandora workflow.

De novo variant in a genome graph

Finding candidate regions

Per-base coverage

Position in gene (bp)

De novo variant in a genome graph

Enumerating paths through candidate regions

Evaluation

Simulated data

Aim to show that the addition of de novo discovery allows pandora to improve its probability of variant detection (recall).

Simulated data

  • 100 local PRGs (genes)

  • Random path from each joined together into a genome

  • Introduce variants at a given rate

  • Simulate Nanopore reads from mutated genome

  • Run pandora with reads from mutated genome

  • Assess how many introduced variants were found

Simulated data

Empirical data - four-way comparison

  • The main focus of both pandora and de novo evaluation

  • Use compare routine to show the power of the reference-graph approach

  • 4 E. coli samples from different phylogroups

  • Compare to other variant callers - snippy, medaka, nanopolish - using a variety of references

Empirical data - four-way comparison

  • Align each pair of genomes with nucmer to get differences

  • Construct truth panel from these differences

  • Map truth panel to a panel of probes from pandora VCF

  • Calculate recall and precision for all pairs

Difficulty in evaluating four-way is "truth"

Preliminary Results

Work Completed

  • I wrote the de novo methods in C++ and added them into the pandora code-base. ~850 lines of source code and ~3200 lines of test code.

  • Written ~1250 lines of code for the evaluation, along with ~3500 lines of test code

  • Built a snakemake pipeline of approximately
    3500 lines of codes to orchestrate the entire evaluation (and simulations).

Outstanding Work

  • Completion of the four-way analysis

  • Direct integration of de novo variants back into PRG

  • 100-way analysis to validate the limit in variant detection using single-reference approaches

4.2 Chapter 2

Genome graph applications for

M. tuberculosis and public health

Benchmark Nanopore versus Illumina SNP calling, show our algorithms meet the needs of clinical and public health users, validate, and publish.

M. tuberculosis public health applications

Requirements from genome sequencing:

  • Species identification

  • Prediction of drug resistance

  • Epidemiological clustering of sample

Focus is on how pandora can be used to improve the last two of these requirements

Want to show that Nanopore sequencing is now capable of performing these tasks

vs.

Data

  • 119 samples from Madagascar (35 PacBio)

  • 83 samples from South Africa (evidence of transmission pairs)

  • 20 samples from Birmingham

Same isolate DNA extraction sequenced on both technologies

Genetic clustering

The first step towards clustering a set of genomes is determining a distance matrix.

  • Feeding aligned genomes into a phylogenetic tree-building tool

  • Counting SNP differences and clustering based on these

Genetic clustering issues

  • Reference-matching site vs. genotyping uncertainty - compare solves this

  • Do we exclude positions where any sample has missing
    data? Could be missing or could be low coverage. Covered in Chapter 1 mostly.

  • How to handle heterozygosity (mixed samples)?

Genetic clustering

We define genetic distance to be the sum of genetic discordances, where missing data and heterozygosity do not cause discordance (unless the zygosity does not include the reference allele) and study the clustering this definition generates. We will also look at traditional trees as another method for clustering.

Issues for multi-sample comparison

How to compare pandora results to other methods which use a single-reference approach?

Baseline variant analysis

Variant truth sets

  • Illumina - clockwork, combining best of samtools and cortex

  • Nanopore - samtools with some filtering and masking

Baseline Illumina/Nanopore concordance, using PacBio as a validation (where we have it)

Analysis

Four PRGs

  • Linear - H37Rv

  • Sparse - H37Rv + 10% frequency variants

  • Dense - Sparse with 1% frequency variants

  • Representative - 2 high-quality genomes from each lineage + 5% frequency variants

Analysis

For each PRG

  • SNPs and indels pandora calls

  • Compare to baseline calls

  • Report concordance rate

  • How does complexity of PRG affect concordance and computational cost

Analysis - clustering

  • Calculate pairwise SNP distance from truth-set

  • Calculate pairwise SNP distance from pandora with "best" PRG

  • Main figure will be dot plot of the two distance matrices - hoping for linear relationship

4.3 Chapter 3

Prediction of drug resistance in

M. tuberculosis using genome graphs and Nanopore sequencing

Improve upon current whole-genome sequencing-based drug resistance prediction for M. tuberculosis using genome graphs.

Mykrobe

Uses a panel of resistance markers to predict drug resistance from WGS data

The predictive power of Mykrobe likely to expand during this PhD due to CRyPTIC consortium

CRyPTIC

Comprehensive Resistance Prediction for Tuberculosis:
an International Consortium

  • Perform DST (14 drugs) and WGS on 40,000 global M. tuberculosis samples (many MDR)

  • Combine with WGS data from another 60,000 samples

  • The aim is to improve genotypic resistance prediction by expanding our catalogue of resistance mutations.

Chapter aim

Aims based on the assumption that a large part of the work by CRyPTIC will be available.

 

Given the collection of SNPs and indels identified as being necessary for resistance to the 14 major drugs tested, we want to show that we can detect them as well with Nanopore data as we can with Illumina.

Limitations of existing methods

  • Only two support Nanopore - Mykrobe and TBProfiler

  • Methodology - the same panel produces different results between tools

  • Small sample sizes (n<6) used to validate Nanopore

Limitations of existing methods

  • TBProfiler uses pileup approach - poor indel power. Indels are important for resistance in some genes

  • Mykrobe uses k-mer mapping - requires high coverage. K-mers considered in isolation

  • Both only genotype wrt known variants

  • CRyPTIC have shown flagging unknown mutations can lead to specificity and sensitivity acceptable for clinical usage (used by PHE)

Solutions

Pandora can use smaller k-mer size than Mykrobe as it takes context into account. Therefore it theoretically requires less coverage.

Pandora can call novel variants (Chapter 1)

Drug susceptibility prediction for M. tuberculosis using pandora

  • Produce gene-succinct PRG of variants known to cause resistance/susceptibility

  • Write a software program that takes pandora output and produces resistance predictions or flag for phenotyping

  • Validate on data from Chapter 2 against Mykrobe for Illumina and Nanopore

4.4 Chapter 4

Construction of a M. tuberculosis reference pan-genome

Curate a high-quality reference pan-genome for M. tuberculosis that includes a detailed map of the pe/ppe genes.

Reasons for a M. tuberculosis pan-genome

  • Closed pan-genome

  • Some genes not present in H37Rv

  • ~10% of the genome consists of pe/ppe genes

The enigmatic pe/ppe genes

  • High GC-content

  • Implicated in immune evasion and virulence

  • A disproportionately large amount of genetic diversity

  • Nucleotide diversity ~2-fold higher than the rest of the genome

  • Sufficiently similar that short reads fail to map

  • Frequently masked out of analyses

Chapter aims

The ability to accurately map sequencing reads to these genes would likely improve our ability to perform variant calling in M. tuberculosis and therefore better determine how isolates relate to each other.

Build a high-quality pan-genome for M. tuberculosis, to allow variant discovery in all genes - ideally including the pe/ppe genes.

Assembly and multiple sequence alignment of high-quality M. tuberculosis genomes

  • Assemble highest quality genomes from Chapter 2 plus data from outside this thesis

  • Assemblies will act as a "scaffold" for the pan-genome along with CRyPTIC variants

  • Large-scale multiple sequence alignment of assemblies to assess genome stability

  • Divide the genome into discrete pieces

A genome graph map of pe/ppe genes

  • If pe/ppe genes arose via gene conversion short reads likely multi-map

  • With high-quality short+long read assemblies, we hope to improve current resolution and allow more accurate mapping

Produce a collection of high-quality pe/ppe PRGs with information about what read length will provide reliable mapping, and whether Illumina data can be reliably mapped to them.

Analysis

Re-analyse data from Chapter 2 and see if we are better able to cluster samples with this new pan-genome with pe/ppe map

Assess variation in pe/ppe genes across 10,000 samples from CRyPTIC

5. Publication Strategy

  • The paper covering pandora and the work in Chapter 1 is currently in preparation - titled Nucleotide-resolution bacterial pan-genomics with reference graphs.

  • Rachel Colquhoun first-author and I will be the second author.

  • The work I will have contributed to this paper include the addition of the de novo variant discovery and a large amount of the evaluation of pandora.

  • We aim to submit the paper by the end of 2019.

Chapters 2 and 3

  • Combined into a single paper

  • I will be the first author

  • Aim to have work completed and manuscript submitted in the second quarter of 2020

Chapter 4

  • I will be the first author

  • Hard to estimate date for publication yet (hopefully before I finish)

2. Overview of Previous Activities

Introduction to Container Computing with Singularity

  • EMBL Heidelberg in February 2019

  • Introduce EMBL scientists to Singularity containers
  • Co-taught with Josep Moscardó and organised by EMBL BioIT initiative

  • I taught and created material for making and running containers, how to share and store them, and also how to integrate into common workflow management systems.

Primers for predocs: Using bash more efficiently

  • EMBL-EBI in January 2019

  • Teach EMBL-EBI predocs tips and trick for using bash shell language more efficiently
  • Handy aliases I regularly use for working with sequencing data, a brief introduction to containers, searching their command history quickly, and advanced methods for finding files...

Website for the EMBL PhD symposium

  • Most of 2018

  • Set up and maintain the website for the symposium

Conferences

  • London Calling - London, May 2019

  • Applied Bioinformatics and Public Health Microbiology - Hinxton, June 2019
  • EMBL Lab Day - Heidelberg, July 2019

TAC2

By Michael Hall

TAC2

Slide deck for my second-year presentation to my thesis advisory committee.

  • 291