Aug 2013 MEETUP

DC/Baltimore Graph Database

6:30 - Pizza/Beer
7:00 - Announcements
7:05 - Freebase in Neo4j

Our idea

  • to look into using conceptual data in Neo4j
  • possible applications:
    • concept linking/NLP
    • analysis of concept data
    • merging/comparing multiple data sources
  • possible data sources:

RDF2Neo

  • The idea was simple: create a generic way to import RDF to Neo4j
  • Decided to try it with Freebase dumps (~2B records, 50M items), 19GB gzipped file... ok, maybe we should have picked something smaller!
  • Features we wanted:
    • ability to filter (we didn't really want all of freebase)
    • fast (less than 24 hours, please?)
    • creating properties as properties on nodes, not just storing RDF triples in Neo

Concepts - RDF

RDF: Resource Description Framework
  • uses a schema with URIs
  • is stored as triples of subjects, predicates, and objects
  • can store pretty much anything
  • SPARQL?

@prefix : <http://www.example.org/> .
:john    a           :Person .
:john    :hasMother  :susan .
:john    :hasFather  :richard .
:richard :hasBrother :luke .
:richard prop:height "123cm" .

Concepts - Property Graph


  • Neo4j uses the property graph model
  • Nodes (vertices) and Relationships (edges) have properties
  • Rather than the data being spread out (completely normalized), nodes have pointers to their properties, and relationships, and related nodes
  • Pragmatic, fast
  • Cypher!

First pass - How do we do this?

  • as we were starting to look at the data: we're going to need a way to query this data in this giant text file
  • let's put it in an intermediate database.. what's a fast inserting database? how about mongodb! (unsharded)
    • unoptimized: 15k inserts/second (raw triples)
    • batch insert (10 per): 25k inserts/second (raw triples)
    • disable journaling! (if we kill the server, who cares in this situation): 50k inserts/second (raw triples)
  • at this point, there were no indexes, but we got it to insert in less than 24 hours
  • indexing one field of ~2B records took another 40 hours!
  • there has to be a better way

Second pass

  • let's just store the triples in neo4j and query them--neo4j is pretty fast for a single machine, right?
  • a transaction per triple: ~5k inserts/sec
  • bigger transactions: ~20k inserts/sec
  • batchinserter API: ~100k inserts/sec without optimization
  • so we beat mongodb for a single machine storing triples! probably because of the lack of a network interface.
  • but, do we need to actually store this stuff like this? why not just massage it into a property graph as we read it, instead of this intermediate format.

Third pass - reasoning rdf2NEO

  • we need a way to determine what a node is (for our export purposes)
  • we need a way to determine types of nodes (labels)
  • we need a way to determine whether a triple with a node subject is a property or a relationship to another node (this is where it became hard to be generic)
  • it would be nice to normalize the relationship types and labels for nodes between datasources (NLP--concept merging?)

THIRD Pass design - two passes

  • first pass (nodes and labels)
    • gather up all the nodes we want to import
    • set labels for the nodes (Neo4j 2.0!)
  • a node is determined by a predicate filter, for freebase this is "ns:type.type.instance"
    • the subject is the type, and the object is the machine id
    • we can then filter by type¬†
      • we used a wildcard: "ns:chemistry" (startsWith)

third pass design - two passes

  • second pass (rels and props)
    • a relationship is found when a subject is a machine id we have in our set, and an object is also a machine id we have in our set
      • if the object is a machine id not in our set, we drop it
    • a property is found when the subject is a machine id and the object is not a machine id
  • down to ~90 minutes for chemistry import

    Surprises/Gotchas

    • Huge performance increase happened while rewriting Java String.split() to a custom function that doesn't generate an array (2-3x boost--can probably be even better)
    • Needed to perform at least two passes over the RDF because of ordering issues (or a smarter way? to determine whether something is a node you want), if you're filtering the nodes you want
    • Bidirectional relationships all over the place in freebase
    • Property keys (predicates) and values (objects) both have localization notations: "@en" and ".en"-style

    Continued work

    • make the rdf2neo code generic enough that it doesn't depend on certain aspects of freebase (the machine-id)
    • make it multithreaded (the BatchInserter API is not thread safe, but you can do some things in separate threads if you're careful)--at the very least we can read/write in separate threads
    • make it sync a database with a new RDF dump, without the need to start over
    • make it manage indexes automatically
    • code the ignore/exclude settings
    • make it more modular (maybe even a scala BatchInserter wrapper so we don't have to use JavaConverters)

    It's on github

    • https://github.com/wfreeman/rdf2neo
    • example settings file in: src/main/resources/rdf2neo.json.example
    • runnable with 'sbt run' (sbt is the simple build tool for scala)
    • eventually we'll have pre-built jars/startup scripts in case non-scala people are interested in building their own portion of freebase (and maybe other RDF datasources) on Neo4j

    Aug 2013 MEETUP

    By Wes Freeman

    Navigation instructions

    Press the space key or click the arrows to the right