Aug 2013 MEETUP

DC/Baltimore Graph Database

6:30 - Pizza/Beer
7:00 - Announcements
7:05 - Freebase in Neo4j

Our idea

  • to look into using conceptual data in Neo4j
  • possible applications:
    • concept linking/NLP
    • analysis of concept data
    • merging/comparing multiple data sources
  • possible data sources:

RDF2Neo

  • The idea was simple: create a generic way to import RDF to Neo4j
  • Decided to try it with Freebase dumps (~2B records, 50M items), 19GB gzipped file... ok, maybe we should have picked something smaller!
  • Features we wanted:
    • ability to filter (we didn't really want all of freebase)
    • fast (less than 24 hours, please?)
    • creating properties as properties on nodes, not just storing RDF triples in Neo

Concepts - RDF

RDF: Resource Description Framework
  • uses a schema with URIs
  • is stored as triples of subjects, predicates, and objects
  • can store pretty much anything
  • SPARQL?

@prefix : <http://www.example.org/> .
:john    a           :Person .
:john    :hasMother  :susan .
:john    :hasFather  :richard .
:richard :hasBrother :luke .
:richard prop:height "123cm" .

Concepts - Property Graph


  • Neo4j uses the property graph model
  • Nodes (vertices) and Relationships (edges) have properties
  • Rather than the data being spread out (completely normalized), nodes have pointers to their properties, and relationships, and related nodes
  • Pragmatic, fast
  • Cypher!

First pass - How do we do this?

  • as we were starting to look at the data: we're going to need a way to query this data in this giant text file
  • let's put it in an intermediate database.. what's a fast inserting database? how about mongodb! (unsharded)
    • unoptimized: 15k inserts/second (raw triples)
    • batch insert (10 per): 25k inserts/second (raw triples)
    • disable journaling! (if we kill the server, who cares in this situation): 50k inserts/second (raw triples)
  • at this point, there were no indexes, but we got it to insert in less than 24 hours
  • indexing one field of ~2B records took another 40 hours!
  • there has to be a better way

Second pass

  • let's just store the triples in neo4j and query them--neo4j is pretty fast for a single machine, right?
  • a transaction per triple: ~5k inserts/sec
  • bigger transactions: ~20k inserts/sec
  • batchinserter API: ~100k inserts/sec without optimization
  • so we beat mongodb for a single machine storing triples! probably because of the lack of a network interface.
  • but, do we need to actually store this stuff like this? why not just massage it into a property graph as we read it, instead of this intermediate format.

Third pass - reasoning rdf2NEO

  • we need a way to determine what a node is (for our export purposes)
  • we need a way to determine types of nodes (labels)
  • we need a way to determine whether a triple with a node subject is a property or a relationship to another node (this is where it became hard to be generic)
  • it would be nice to normalize the relationship types and labels for nodes between datasources (NLP--concept merging?)

THIRD Pass design - two passes

  • first pass (nodes and labels)
    • gather up all the nodes we want to import
    • set labels for the nodes (Neo4j 2.0!)
  • a node is determined by a predicate filter, for freebase this is "ns:type.type.instance"
    • the subject is the type, and the object is the machine id
    • we can then filter by type¬†
      • we used a wildcard: "ns:chemistry" (startsWith)

third pass design - two passes

  • second pass (rels and props)
    • a relationship is found when a subject is a machine id we have in our set, and an object is also a machine id we have in our set
      • if the object is a machine id not in our set, we drop it
    • a property is found when the subject is a machine id and the object is not a machine id
  • down to ~90 minutes for chemistry import

    Surprises/Gotchas

    • Huge performance increase happened while rewriting Java String.split() to a custom function that doesn't generate an array (2-3x boost--can probably be even better)
    • Needed to perform at least two passes over the RDF because of ordering issues (or a smarter way? to determine whether something is a node you want), if you're filtering the nodes you want
    • Bidirectional relationships all over the place in freebase
    • Property keys (predicates) and values (objects) both have localization notations: "@en" and ".en"-style

    Continued work

    • make the rdf2neo code generic enough that it doesn't depend on certain aspects of freebase (the machine-id)
    • make it multithreaded (the BatchInserter API is not thread safe, but you can do some things in separate threads if you're careful)--at the very least we can read/write in separate threads
    • make it sync a database with a new RDF dump, without the need to start over
    • make it manage indexes automatically
    • code the ignore/exclude settings
    • make it more modular (maybe even a scala BatchInserter wrapper so we don't have to use JavaConverters)

    It's on github

    • https://github.com/wfreeman/rdf2neo
    • example settings file in: src/main/resources/rdf2neo.json.example
    • runnable with 'sbt run' (sbt is the simple build tool for scala)
    • eventually we'll have pre-built jars/startup scripts in case non-scala people are interested in building their own portion of freebase (and maybe other RDF datasources) on Neo4j

    Aug 2013 MEETUP

    By Wes Freeman