Integrate all: hetnets in human disease.

September 24, 2015

Smilow 10-120, UPenn

dhimmel on:

Greene Lab Interview

—Daniel Himmelstein

Moore Lab Lunch

Jesse's Tavern

August, 2007

Sandler Neurosciences Center

Sergio

Hodgkin's lymphoma is genetically closer to autoimmune diseases than solid cancers

Founding Insight

  • context
    bioinformatics  data explosion
     
  • goal
    mine the data to advance human health
     
  • problem
    high-throughput data tends to predict 
    weakly
     
  • remedy
    combine diverse datasets into a strong predictor
     
  • method
    heterogeneous network (hetnet) edge prediction

predicting disease-associated genes

wealth of GWAS associations

integrate diverse data to provide context

network of pathogenesis:

 

  • 18 metanodes
  • 40,343 nodes
     
  • 19 metaedges
  • 1,608,168 edges

type is essential when operating on hetnets

metapath-based approach

feature extraction: the DWPC

learning to classify:

regularized logistic regression

performance and permuation

mechanisms of pathogenesis:

comparing gene set collections

mechanisms of pathogenesis:

comparing metapaths

Predicting withheld MS associations

Novel MS associations

Extra Slides Below

gene extraction from the GWAS Catalog

Feature redundancy

Metanodes (node types)

Metaedges (edge types)

  • robustness

Network subsampling

drug repurposing

massively collaborative open science

22 reviewers, 47 discussions, 266 comments

Mining knowledge from 69 years of biomedical publication

  • MEDLINE: curators annotate paper topics
     
  • 21 million articles
  • since 1946
  • 5,594 journals
     
  • cooccurence of two topics indicates a relation
  • diseasesymptom cooccurence in 363,928 articles, 696,252 for diseaseanatomy

Discuss on

anatomy

symptom

Mining MEDLINE for disease context

LINCS L1000: ~20,000 small molecules

We combine all signatures for each DrugBank compound to get a consensus signature

Brueggeman

transcriptional signatures discriminate

diuretics (DR) & anti-Inflammatories (AA)

← genes 

Disease-specific models uncover therapeutic signatures

  • MEDI-HPS:
    • RxNorm
    • MedlinePlus
    • SIDER 2
    • Wikipedia
  • ehrlink: linked data from health records
  • LabeledIn: expert and MTurk curated drug labels
  • PREDICT:
    • UMLS links
    • drugs.com
    • drug labels

catalog of indications

aggregated 4 databases:

yielding 1,388 indications

drug disease ajg csh eq
Acetylsalicylic acid systemic lupus erythematosus DM DM 1
Alprazolam systemic scleroderma SYM SYM 1
Baclofen multiple sclerosis SYM SYM 1
Bupropion panic disorder SYM DM 0
Captopril rheumatoid arthritis NOT NOT 1
Cisplatin hematologic cancer DM DM 1
Cladribine hematologic cancer DM DM 1
Clopidogrel coronary artery disease DM DM 1
Cocaine dental caries NOT SYM 0
class ajh csh
DM 26 32
SYM 20 17
NOT 4 1

curation pilot results

  • ~58% disease modifying
  • ~37% symptomatic
  • ~5% non-indications
  • discuss on 

66% ✓

legal forays of a modern biodata scientist

Network contains data from 28 resources:

  • 12 lack any licensing information
  • 10 use standard licenses
  • 6 use custom licenses
  • 3 resources are publication supplements
  • 6 forbid commercial use
  • 2 forbid any redistribution of the data

Project is open notebook and maximally reusable

After a 5000+ word discussion:

  • identified a path to compliance that minimizes sacrifice
  • suggestion: release your data under CC0 

results

metapath nonzero auroc
CcSEcCiD 0.590 0.897
CiDiCiD 0.255 0.840
CiDaGaD 0.411 0.820
CtGtCiD 0.227 0.807
CiDpSpD 0.410 0.801
CiDlAlD 0.400 0.797
CtGiGaD 0.335 0.710
CuG<kuGaD 0.425 0.692
CtGvD 0.009 0.514

Results for all metapaths ≤ length 3

feature performance

(including symptomatic indications)

model

collinearity

subset of all 261 features

hetnets

  • data integration
    • scalable
    • intuitive
  • edge prediction
    • disease-associated genes
    • drug indications
  • powerful
    • upfront integration cost
    • subsequent convenience
>>> import phd
Made with Slides.com