Classifying EFO/MONDO diseases as low, medium, or high precision using nxontology-ml

Related Sciences Data Team

Mondo Workshop / Outreach Call

2023-09-22

slides released under CC BY 4.0

slides.com/dhimmel/efo-disease-precision

Prepared at nxontology-ml/issues/13

nxontology software suite

nxontology
NetworkX-based Python library for representing ontologies. Uses fastobo, pronto, pygraphviz.
nxontology-data
Making ontologies accessible as simple JSON files (even MeSH). Uses rdflib, SPARQL, etc.
nxontology-ml
Machine learning to classify ontology nodes.

Motivation

Posted to the obo-community slack by Philip Strömert generated with Midjourney prompt:

We cannot interpret our research data anymore because we did not annotate it with ontologies

Really easy access to popular biomedical ontologies based on JSON files that can be read into Python's networkx.

disease precision classification

EFO OTAR Slim
- OTAR = Open Targets, see efo/issues/926
- Derivative of EFO focused on clinical interpretation
- Rooted by therapeutic class terms, other terms pruned
Approach can be applied to MONDO
Precision levels
- low
- medium
- high

nxontology-ml/issues/2

precision levels

Low
- group diseases with some, but often not many, shared characteristics
- indications in early stage clinical trials
- heterogenous patient population
- used by ontology for organization & completeness
Medium
- indications in later stage clinical trials
- group patients with a condition with shared physiological or environmental origin
- diagnosable
High
- small groups of relatively homogeneous patients
- greater diagnostic certainty
- represent the forefront of clinical practice (precision medicine)

nxontology-ml/issues/2

early feedback

Is Bardet-Biedl syndrome a single disease entity?
Is Alzheimer disease a single disease?
But Alzheimer's has a number of different genetic etiologies - are you sure it's a single disease?

nxontology-ml/issues/2

How many diseases are there?
Is there an existing treatment for SLE Type 16? Downward propagation SLE treatments makes sense, but probably not a treatment for connective tissue disease.
What diseases should we predict treatments for?
Untangling the hairball.

applications

nxontology-ml/issues/2

Training labels

Semi-automated hand labeling of terms in v3.43.0

features

feature groups:

topology
cross-references
prefixes
subsets
descriptions
gpt tags
gwas

topology features

derived solely from the topology of the directed acyclic graph

depth
n_ancestors
n_descendants
n_parents
n_roots
n_children
n_leaves
intrinsic_ic
intrinsic_ic_scaled
intrinsic_ic_sanchez
intrinsic_ic_sanchez_scaled

source code

cross-reference features

doid
gard
icd10
icd9
meddra
mesh
mondo
ncit
omim
omim.ps
orphanet
snomedct
umls

cross-references (xref) counts by external database.

nxontology-data performs bioregistry normalization of xref prefixes

source code

prefix features

EFO: 12,312
MONDO: 9,061
Orphanet: 2,052
HP: 1,384
GO: 335
OBA: 40
OTAR: 9
DOID: 6
OBI: 4
NCIT: 3
MP: 2
OGMS: 1

EFO includes terms imported from other ontologies.
Counts from EFO Otar Slim v3.57.0

from collections import Counter
from nxontology import NXOntology
url = "https://github.com/related-sciences/nxontology-data/raw/2ce01d8495024d46cbc54fb0c26a92500ad717e0/efo_otar_slim.json"
nxo = NXOntology.read_node_link_json(url)
prefix_counts = Counter()
for node in nxo.graph.nodes:
    prefix, _ = node.split(":")
    prefix_counts[prefix] += 1
for prefix, count in prefix_counts.most_common():
    print(f"{prefix}: {count:,}")

source code, issue

subset features

mondo#disease_grouping
mondo#gard_rare
mondo#ordo_etiological_subtype
mondo#ordo_group_of_disorders

EFO includes subsets, many imported from MONDO. We're not actually sure how to find documentation for these subsets, but they sound relevant.

GWAS feature

EFO notes whether a term is a GWAS trait.

description features

Large "Embedding" Vector (768 demensions)

source code

EFO Node description (term name & definition)

Compressed "Embedding" Vector (64 dimensions)

Transformer Model ("BioLinkBERT")

Dimensionality reduction (e.g. PCA, LDA)

MONDO:0005301 🡒 multiple sclerosis: A progressive autoimmune disorder affecting the central nervous system resulting in demyelination. Patients develop physical and cognitive impairments that correspond with the affected nerve fibers. [ NCIT : P378 ]

ChatGPT features

source code

A list of records will be provided from an ontology of disease terms. Each record will contain information describing a single term.

Assign a `precision` label to each of these terms that captures the extent to which they correspond to patient populations with distinguishing clinical, demographic, physiological or molecular characteristics. Use exactly one of the following values for this label:

- `high`: High precision terms have the greatest ontological specificity, sometimes (but not necessarily) correspond to small groups of relatively homogeneous patients, often have greater diagnostic certainty and typically represent the forefront of clinical practice.
- `medium`: Medium precision terms are the ontological ancestors of `high` precision terms (if any are known), often include indications in later stage clinical trials and generally reflect groups of patients assumed to be suffering from a condition with a shared, or at least similar, physiological or environmental origin.
- `low`: Low precision terms are the ontological ancestors of both `medium` and `high` precision terms, group collections of diseases with *some* shared characteristics and typically connote a relatively heterogenous patient population. They are often terms used within the ontology for organizational purposes.

The records provided will already have the following fields:

- `id`: A string identifier for the term
- `label`: A descriptive name for the term
- `description`: A longer, possibly truncated description of what the term is; may be NA (i.e. absent)

Here is a list of such records (in YAML format) where the `precision` label is already assigned for 3 examples at each level of precision:

Examples:

Prompt: https://gist.github.com/yonromai/daa0bd3c8e9f81250a68c3f6b614598d
Completion 1: https://gist.github.com/yonromai/68e74c7a8032d9f59e427f6196652d91
Completion 2: https://gist.github.com/yonromai/f2b852e20f42bfea2da13b3e755267fe
Completion 3: https://gist.github.com/yonromai/66ff9de3141cb4dcd35cd11cd06c936e

Each prompt request generates 3 responses, used as votes

poll

Which feature group has the greatest influence on the outcome?

Slido: 1342 945

topology
cross-references
prefixes
subsets
descriptions
gpt tags
gwas

ML pipeline overview

catboost slide

todo

Source: https://catboost.ai/news/catboost-enables-fast-gradient-boosting-on-decision-trees-using-gpus

model performance

feature group importance

feature group importance

top individual features

intrinsic_ic_sanchez source

Predictions & Availability

Visualization: precision by border

low = dotted
medium = dashed
high = solid

efo_otar_slim_v3.57.0_precisions.tsv

questions

next steps

data users
collaborators
contributors
predictions internalized to EFO / MONDO?

Classifying EFO/MONDO diseases as low, medium, or high precision using nxontology-ml

By Daniel Himmelstein

Classifying EFO/MONDO diseases as low, medium, or high precision using nxontology-ml

https://github.com/related-sciences/nxontology-ml/issues/13

1,716

Daniel Himmelstein

Head of Data Integration at Related Sciences. Digital craftsman of the biodata revolution.

Classifying EFO/MONDO diseases as low, medium, or high precision using nxontology-ml

nxontology software suite

Motivation

disease precision classification

precision levels

early feedback

applications

Training labels

features

topology features

cross-reference features

prefix features

subset features

GWAS feature

description features

ChatGPT features

poll

ML pipeline overview

catboost slide

model performance

feature group importance

feature group importance

top individual features

Predictions & Availability

questions

next steps

Classifying EFO/MONDO diseases as low, medium, or high precision using nxontology-ml

Classifying EFO/MONDO diseases as low, medium, or high precision using nxontology-ml

Daniel Himmelstein

More from Daniel Himmelstein