Integrating biomedical knowledge using hetnets

Rising Stars Symposium in Data Science

University of Chicago

KCBD 1103 · 4:00 PM

September 13, 2017

Slides at slides.com/dhimmel/chicago

The hetnet awakens: understanding complex diseases through data integration and open science

Greene Lab

I'm a data scientist

http://www.greenelab.com/

Abstract

Approaches in network medicine have traditionally focused on generating insights from graphs with a single type of node and relationship. However, biology's complexity demands a richer network structure capable of integrating diverse, multi-scale information. Towards this end, we develop hetnets — networks with multiple types of nodes and relationships.

Specifically we created Hetionet — a network of biology, disease, and pharmacology. This resource encodes knowledge from millions of biomedical studies over the last half century. Version 1.0 contains 47,031 nodes of 11 types and 2,250,197 relationships of 24 types. We host a public Neo4j database instance at https://neo4j.het.io allowing users to interact with Hetionet.

In Project Rephetio, we applied Hetionet to predict new uses for existing drugs. Our approach learned the network patterns of connectivity that differentiate treatments from non-treatments, enabling us to predict the probability of treatment for 209,168 compound–disease pairs. These predictions prioritize treatments under investigation by clinical trial.

Going forward, we're investigating more efficient algorithms for feature extraction on hetnets. In addition, we're looking to automate hetnet construction by text mining the literature. The success of hetnets will depend on the availability of openly licensed inputs. As such, I'll briefly discuss data analyses I've performed in hopes of making science more open.

How I became intestested in graphs

http://blog.dhimmel.com/friendship-network/

My Facebook friendship network in 2014

Graphs are composed of:

  • Nodes
  • Relationships

Nodes / relationships have type:

  • node types
    (person, course, university)
  • relationship types
    (lecturer, institution)
  • first_name: Daniel
  • last_name: Himmelstein
  • twitter: @dhimmel
  • SSN: 012-34-5678 
  • grade: A-
  • grade: B
  • grade: F

What can we do with this graph?

  • Course statistics:
    How many students are in CIS 550?
     
  • Course recommendations:
    What courses do other students in CIS 550 take?
     
  • Course scheduling:
    What room should a course be in so it's nearby other courses that its students are enrolled in?

networks with multiple node or relationship types

multilayer network, multiplex network, multivariate network, multinetwork, multirelational network, multirelational data, multilayered network, multidimensional network, multislice network, multiplex of interdependent networks, hypernetwork, overlay network, composite network, multilevel network, multiweighted graph, heterogeneous network, multitype network, interconnected networks, interdependent networks, partially interdependent networks, network of networks, coupled networks, interconnecting networks, interacting networks, heterogenous information network

A 2012 Study identified 26 different names for this type of network:

hetnet

How do you teach a computer biology?

Visualizing Hetionet v1.0

  • Hetnet of biology for drug repurposing
     
  • ~50 thousand nodes
    11 types (labels)
     
  • ~2.25 million relationships
    24 types
     
  • integrates 29 public resources
    knowledge from millions of studies

Hetionet v1.0

  • Nodes
    • standardized vocabularies
    • stable, unambiguous identifiers
       
  • Relationships:
    • Omics scale required
    • Literature mining
    • High throughput experimental technologies
    • Avoid manual mapping
       
  • Versioned data dependencies with GitHub commit hash URLs
    pandas-dev/pandas#14576

Constructing Hetionet v1.0

>>> import phd

What's the best software for storing and querying hetnets?

dhimmel/hetio
86
5
2
neo4j/neo4j
42,498
3,071
1,007

GitHub stats from 2016-10-09

  • Customized Docker image
  • Digital Ocean droplet
  • SSL from Let's Encrypt
  • readonly mode with a query execution timeout
  • Custom GRASS style
  • Custom guides

Public Hetionet Neo4j Instance

Details at doi.org/brsc

MATCH path =
  // Specify the type of path to match
  (n0:Disease)-[e1:ASSOCIATES_DaG]-(n1:Gene)-[:INTERACTS_GiG]-
  (n2:Gene)-[:PARTICIPATES_GpBP]-(n3:BiologicalProcess)
WHERE
  // Specify the source and target nodes
  n0.name = 'multiple sclerosis' AND
  n3.name = 'retina layer formation'
  // Require GWAS support for the
  // Disease-associates-Gene relationship
  AND 'GWAS Catalog' in e1.sources
  // Require the interacting gene to be
  // upregulated in a relevant tissue
  AND exists(
    (n0)-[:LOCALIZES_DlA]-(:Anatomy)-[:UPREGULATES_AuG]-(n2))
RETURN path

How could multiple sclerosis could affect retina layer formation?

More queries at thinklab.com/d/220

Project Rephetio contributions on Thinklab

(see thinklab.com/p/rephetio/leaderboard)

Project Rephetio: drug repurposing predictions

  • Hetionet v1.0 contains:
    • 1,538 connected compounds
    • 136 connected diseases
    • 209,168 compound–disease pairs
    • 755 treatments
  • Systematic drug repurposing:
    • Compare the therapeutic utility of data types
    • Identify the mechanisms of drug efficacy
    • Predict the probability of treatment for all 209,168 compound–disease pairs (het.io/repurpose)

Systematic integration of biomedical knowledge prioritizes drugs for repurposing
Daniel S Himmelstein, Antoine Lizee, Christine Hessler, Leo Brueggeman, Sabrina L Chen, Dexter Hadley, Ari Green, Pouya Khankhanian, Sergio E Baranzini
In pres at eLife. https://doi.org/10.1101/087619

features = metapaths

observations =

compound–disease pairs

positives = treatments

negatives =

non-treatments

Machine learning methodology

Predictions succeed at prioritizing known treatments

1,206 compound–disease metapaths (length ≤ 4)

  1. Upper tier:
    traditional pharmacology
  2. Upper-middle tier:
    traditionally biomedicine, but newer in drug efficacy
  3. Lower-middle tier:
    genome-wide / high-throughput data sources
  4. Lower tier:
    cellular components

Browse at het.io/repurpose/metapaths.html

Project Rephetio: Does bupropion treat nicotine dependence?

  • Bupropion was first approved for depression in 1985
     
  • In 1997, bupropion was approved for smoking cessation
     
  • Can we predict this repurposing from Hetionet? The prediction was:

Compound–causes–SideEffect–causes–Compound–treats–Disease

Compound–binds–Gene–binds–Compound–treats–Disease

Compound–binds–Gene–associates–Disease

Compound–binds–Gene–participates–Pathway–participates–Disease

MATCH path = (n0:Compound)-[:BINDS_CbG]-(n1)-[:PARTICIPATES_GpPW]-
  (n2)-[:PARTICIPATES_GpPW]-(n3)-[:ASSOCIATES_DaG]-(n4:Disease)
USING JOIN ON n2
WHERE n0.name = 'Bupropion'
  AND n4.name = 'nicotine dependence'
  AND n1 <> n3
WITH
[
  size((n0)-[:BINDS_CbG]-()),
  size(()-[:BINDS_CbG]-(n1)),
  size((n1)-[:PARTICIPATES_GpPW]-()),
  size(()-[:PARTICIPATES_GpPW]-(n2)),
  size((n2)-[:PARTICIPATES_GpPW]-()),
  size(()-[:PARTICIPATES_GpPW]-(n3)),
  size((n3)-[:ASSOCIATES_DaG]-()),
  size(()-[:ASSOCIATES_DaG]-(n4))
] AS degrees, path
RETURN
  path,
  reduce(pdp = 1.0, d in degrees| pdp * d ^ -0.4) AS path_weight
ORDER BY path_weight DESC
LIMIT 10

Cypher query to find the top CbGbPWaD paths

Try at https://neo4j.het.io

Browse all predictions at het.io/repurpose. Discuss at thinklab.com/d/224

Top 100 epilepsy predictions & their chemical structure

Top 100 epilepsy predictions & their drug targets

Nice of you to share this big network with everyone; however, I think you need to take care not to get yourself into legal trouble here. … 

I am not trying to cause trouble here — just the contrary. When making a meta-resource, licenses and copyright law are not something you can afford to ignore. I regularly leave out certain data sources from my resources for legal reasons.

One network to rule them all

We have completed an initial version of our network. …

Network existence (SHA256 checksum for graph.json.gz) is proven in Bitcoin block 369,898.

Discussion DOIs: bfmkbfmmbfmnbfmp

  • Hetionet (≤ v1.0) integrated data from 31 resources:
    • 5 United States Government works
    • 12 openly licensed
    • 4 non-commercial use only
    • 9 were all rights reserved
    • 1 explicitly & contractually forbid reuse
  • Requested permission for 11 resources:
    • median time to first response was 16 days
    • 2 affirmative responses
  • Other considerations:
    • who owns data
    • incompatibilities: share alike vs non-commercial
    • copyright status of data & fair use
  • Solution: license attribute per node/relationship

Legal barriers to data reuse

Recommendations:

  1. release data under an open license
  2. University researchers: commit to open in your resource sharing plan

hetmech

Kyle Kloster

@kkloste

Michael Zietz

@zietzm

https://github.com/greenelab/hetmech

the hetnet search engine

Future: all biomedical knowledge in a single network

https://github.com/greenelab/snorkeling

  • Teach computers how to read the literature and extract knowledge.
     
  • Continuously and automatically refine and grow the hetnet.
     
  • Free from any legal restrictions on reuse. 

David Robinson

@danich1

  • Sci-Hub: "the first pirate website in the world to provide mass and public access to tens of millions of research papers"
  • Computed Sci-Hub's coverage based on DOIs
  • Sci-Hub provides access to nearly all scholarly literature
  • 85.2% of the 54 million articles in subscription journals
  • 97.3% of Elsevier

Himmelstein​, Romero, McLaughlin, Greshake Tzovaras, Greene​ (2017) PeerJ Preprints https://doi.org/b9s5

Headlines:

  • Science: Sci-Hub’s cache of pirated papers is so big, subscription journals are doomed, data analyst suggest
  • Inside Higher Ed: Inevitably Open
  • Quartz: A pirating service for academic journal articles could bring down the whole establishment

https://doi.org/b9s5

Sci-Hub  ⇒ open scholarly literature?

Manubot

powering the next generation of scholarly manuscript

The Manubot project began with the [Deep Review](https://github.com/greenelab/deep-review),
where it was used to compose a highly-collaborative review article [@doi:10.1101/142760].
Other manuscripts that were created with Manubot include:

+ The Sci-Hub Coverage Study
  ([GitHub](https://github.com/greenelab/scihub-manuscript), [HTML manuscript](https://greenelab.github.io/scihub-manuscript/)) 
  [@doi:10.7287/peerj.preprints.3100]
+ Michael Zietz's Report for the Vagelos Scholars Program
  ([GitHub](https://github.com/zietzm/Vagelos2017), [HTML manuscript](https://zietzm.github.io/Vagelos2017/)) 
  [@doi:10.6084/m9.figshare.5346577]

The Manubot project began with the Deep Review, where it was used to compose a highly-collaborative review article [1]. Other manuscripts that were created with Manubot include:

1. Opportunities And Obstacles For Deep Learning In Biology And Medicine
Travers Ching, Daniel S. Himmelstein, Brett K. Beaulieu-Jones, Alexandr A. Kalinin, Brian T. Do, Gregory P. Way, Enrico Ferrero, Paul-Michael Agapow, Wei Xie, Gail L. Rosen, … Casey S. Greene
Cold Spring Harbor Laboratory (2017-05-28) https://doi.org/10.1101/142760

2. Sci-Hub provides access to nearly all scholarly literature
Daniel S Himmelstein, Ariel R Romero, Stephen R McLaughlin, Bastian Greshake Tzovaras, Casey S Greene
PeerJ Preprints (2017-07-20) https://doi.org/10.7287/peerj.preprints.3100

3. Vagelos Report Summer 2017
Michael Zietz
Figshare (2017) https://doi.org/10.6084/m9.figshare.5346577

Write markdown

Automatically converted to rich text

Automatic bibliographic metadata

[@doi:10.7287/peerj.preprints.3100]
[@arxiv:1407.3561v1]
[@pmid:24159271]
[@url:http://blog.dhimmel.com/biorxiv-licenses/]

1. Modify the manuscript source

2. Continuous integration rebuilds the manuscript

Timestamped on the Bitcoin blockchain via OpenTimestamps

3. Continuous deployment back to GitHub

https://greenelab.github.io/deep-review/

The Deep Review

Pull requests for manuscript collaboration

Get started at tiny.cc/manubot

 

https://github.com/greenelab/manubot-rootstock

Questions?

@dhimmel

0000-0002-3012-7446

Rising Stars in Data Science: The hetnet awakens in Chicago

By Daniel Himmelstein

Rising Stars in Data Science: The hetnet awakens in Chicago

Presentation on 2017-09-13 at the Rising Stars Symposium in Data Science, University of Chicago. This presentation is released under a CC BY 4.0 License.

  • 3,242