The hetnet awakens in Philadelphia.

biology ⭃ network

DataPhilly Meetup

Papadakis Building, Room 120

June 6, 2017

Slides at

The hetnet awakens: understanding complex diseases through data integration and open science

Greene Lab

I'm a data scientist

DataPhilly Talk Abstract:

Hetnets are networks with multiple node and relationship types. He will discuss when hetnets are the right tool for integrating and analyzing diverse types of data. Specifically, he'll showcase Project Rephetio, which predicts new uses for existing drugs. This project created Hetionet, a network with 2.25 million relationships of 24 types, allowing researchers to ask questions that span the many realms of biomedical knowledge.


About Daniel Himmelstein:

Daniel Himmelstein is a "digital craftsman of the biodata revolution" who works in the Greene Lab at Penn. In 2016, he received his PhD in Biological & Medical Informatics from UCSF. Daniel leads the Cognoma (DataPhilly) Datathon and was a finalist for "Scientist of the Year" in the 2016 Philly Geek Awards. His research focuses on integrating open data to uncover the secrets of human health.

How I became intestested in graphs

My Facebook friendship network in 2014

Graphs are composed of:

  • Nodes
  • Relationships

Nodes / relationships have type:

  • node labels
    (person, course, university)
  • relationship types
    (lecturer, institution)
  • first_name: Daniel
  • last_name: Himmelstein
  • twitter: @dhimmel
  • SSN: 012-34-5678 
  • catalog: CIS550
  • title: Database & Information Systems
  • units: 5
  • catalog: EPID600
  • title: Data Science for Biomedical Informatics
  • units: 1
  • url:
  • founded: 1740
  • type: private
  • league: ivy
  • grade: A-
  • grade: B
  • grade: F

What can we do with this graph?

  • Course statistics:
    How many students are in CIS 550?
  • Course recommendations:
    What courses do other students in CIS 550 take?
  • Course scheduling:
    What room should a course be in so it's nearby other courses that its students are enrolled in?

networks with multiple node or relationship types

multilayer network, multiplex network, multivariate network, multinetwork, multirelational network, multirelational data, multilayered network, multidimensional network, multislice network, multiplex of interdependent networks, hypernetwork, overlay network, composite network, multilevel network, multiweighted graph, heterogeneous network, multitype network, interconnected networks, interdependent networks, partially interdependent networks, network of networks, coupled networks, interconnecting networks, interacting networks, heterogenous information network

A 2012 Study identified 26 different names for this type of network:


The Relational Database Model

The Relational Database Model


  1. Relationships require an intermediate table
  2. Schemas are cumbersome to create to maintain

Relationships inherently form graphs

What's the best software for storing and querying hetnets?


GitHub stats from 2016-10-09

How do you teach a computer biology?

Visualizing Hetionet v1.0

  • Hetnet of biology for drug repurposing
  • ~50 thousand nodes
    11 types (labels)
  • ~2.25 million relationships
    24 types
  • integrates 29 public resources
    knowledge from millions of studies

Hetionet v1.0

  • Nodes
    • standardized vocabularies
    • stable, unambiguous identifiers
  • Relationships:
    • Omics scale required
    • Literature mining
    • High throughput experimental technologies
    • Avoid manual mapping
  • Versioned data dependencies with GitHub commit hash URLs

Constructing Hetionet v1.0

>>> import phd
  • Customized Docker image
  • Digital Ocean droplet
  • SSL from Let's Encrypt
  • readonly mode with a query execution timeout
  • Custom GRASS style
  • Custom guides

Public Hetionet Neo4j Instance

Details at

MATCH path =
  // Specify the type of path to match
  // Specify the source and target nodes = 'multiple sclerosis' AND = 'retina layer formation'
  // Require GWAS support for the
  // Disease-associates-Gene relationship
  AND 'GWAS Catalog' in e1.sources
  // Require the interacting gene to be
  // upregulated in a relevant tissue
  AND exists(

How could multiple sclerosis could affect retina layer formation?

More queries at

Project Rephetio contributions on Thinklab


Project Rephetio: drug repurposing predictions

  • Hetionet v1.0 contains:
    • 1,538 connected compounds
    • 136 connected diseases
    • 209,168 compound–disease pairs
    • 755 treatments
  • Systematic drug repurposing:
    • Compare the therapeutic utility of data types
    • Identify the mechanisms of drug efficacy
    • Predict the probability of treatment for all 209,168 compound–disease pairs (
  • Project online at

Systematic integration of biomedical knowledge prioritizes drugs for repurposing
Daniel S Himmelstein, Antoine Lizee, Christine Hessler, Leo Brueggeman, Sabrina L Chen, Dexter Hadley, Ari Green, Pouya Khankhanian, Sergio E Baranzini
bioRxiv. 2016. DOI: 10.1101/087619

features = metapaths

observations =

compound–disease pairs

positives = treatments

negatives =


Machine learning methodology

Predictions succeed at prioritizing known treatments

Project Rephetio: Does bupropion treat nicotine dependence?

  • Bupropion was first approved for depression in 1985
  • In 1997, bupropion was approved for smoking cessation
  • Can we predict this repurposing from Hetionet? The prediction was:





MATCH path = (n0:Compound)-[:BINDS_CbG]-(n1)-[:PARTICIPATES_GpPW]-
WHERE = 'Bupropion'
  AND = 'nicotine dependence'
  AND n1 <> n3
] AS degrees, path
  reduce(pdp = 1.0, d in degrees| pdp * d ^ -0.4) AS path_weight
ORDER BY path_weight DESC

Cypher query to find the top CbGbPWaD paths

Try at

Browse all predictions at Discuss at

Top 100 epilepsy predictions & their chemical structure

Top 100 epilepsy predictions & their drug targets

Nice of you to share this big network with everyone; however, I think you need to take care not to get yourself into legal trouble here. … 

I am not trying to cause trouble here — just the contrary. When making a meta-resource, licenses and copyright law are not something you can afford to ignore. I regularly leave out certain data sources from my resources for legal reasons.

One network to rule them all

We have completed an initial version of our network. …

Network existence (SHA256 checksum for graph.json.gz) is proven in Bitcoin block 369,898.

Discussion DOIs: bfmkbfmmbfmnbfmp

  • Hetionet (≤ v1.0) integrated data from 31 resources:
    • 5 United States Government works
    • 12 openly licensed
    • 4 non-commercial use only
    • 9 were all rights reserved
    • 1 explicitly & contractually forbid reuse
  • Requested permission for 11 resources:
    • median time to first response was 16 days
    • 2 affirmative responses
  • Other considerations:
    • who owns data
    • incompatibilities: share alike vs non-commercial
    • copyright status of data & fair use
  • Solution: license attribute per node/relationship

Legal barriers to data reuse


  1. release data under an open license
  2. University researchers: commit to open in your resource sharing plan

Advertisement: Cognoma Meetup with DataPhilly & Code for Philly

Next meetup:

June 27

The hetnet awakens in Philadelphia. biology ⭃ network

By Daniel Himmelstein

The hetnet awakens in Philadelphia. biology ⭃ network

Presentation at the 2017-06-06 DataPhilly meetup. Details at This presentation is released under a CC BY 4.0 License.

  • 3,161