From documents

to graphs

#BuildStuffLT @hannelita

Slides at http://bit.ly/2fYqONz

This talk is about MongoDB and Neo4j :)

Code

http://bit.ly/2g15MPW

Hi!

  • Computer Engineer
  • Programming
  • Electronics
  • Math <3 <3
  • Physics
  • Lego
  • Meetups
  • Animals
  • Coffee
  • GIFs
  • Pokémon

Disclaimer

Views are on my own

Project from late 2015

Mostly for Neo4j 2.x

Project

https://github.com/neo4j-contrib/neo4j_doc_manager

#BuildStuffLT @hannelita

Agenda

  • Quick note about document oriented databases
  • Graph databases can help your data model
  • Creating connectors for MongoDB
  • neo4j_doc_manager general architecture
  • Data mapping
  • Challenges

#BuildStuffLT @hannelita

"We need to restructure our data"

#BuildStuffLT @hannelita

"Relational databases are not enough"

#BuildStuffLT @hannelita

"Polyglot Databases"

#BuildStuffLT @hannelita

Document Oriented DB

  • Flexible data model
  • Easy to get started
  • Easy to represent the data

#BuildStuffLT @hannelita

Store data as Documents!

#BuildStuffLT @hannelita

Imagine that we have talks of a conference

#BuildStuffLT @hannelita

Our Documents

Agenda

  • Quick note about document oriented databases
  • Graph databases can help your data model
  • Creating connectors for MongoDB
  • neo4j_doc_manager general architecture
  • Data mapping
  • Challenges

#BuildStuffLT @hannelita

Sometimes we need to get some extra information

#BuildStuffLT @hannelita

Possible questions

  • Which talks have a specific topic (ex: 'Databases')
  • Which speakers will also talk about this topic?
  • What are the sessions that will be hold into Auditorium and are about this topic?

These are common questions

#BuildStuffLT @hannelita

More questions

  • Assuming that I do not want to change rooms, what is the best room to stay to get a higher number of sessions of a specific topic?

#BuildStuffLT @hannelita

Further work

  • Recommendation system for the talks
  • Recommendation system for speakers
  • Build a tool to automatically build the sessions timetable based on topic distribution

#BuildStuffLT @hannelita

Looks like we need some graphs!

Graphs are everywhere

TEAM, Neo4j

#BuildStuffLT @hannelita

We can build graphs with information from Mongo

#BuildStuffLT @hannelita

Agenda

  • Quick note about document oriented databases
  • Graph databases can help your data model
  • Creating connectors for MongoDB
  • neo4j_doc_manager general architecture
  • Data mapping
  • Challenges

#BuildStuffLT @hannelita

From Documents to Graphs

#BuildStuffLT @hannelita

Neo4j super quick reference

  • Graph oriented database
  • Pure graph structure that you can persist
  • Benefits of graph theory
  • Large and active community
  • Neotechnology

#BuildStuffLT @hannelita

Mongo Connetor

https://github.com/10gen-labs/mongo-connector

#BuildStuffLT @hannelita

Mongo Connector

You

MC

Mongo Connector

#BuildStuffLT @hannelita

Mongo Connector

You

Call Mongo Connector

MC

#BuildStuffLT @hannelita

Mongo Connector

You

Call Mongo Connector

MC

Hi!

#BuildStuffLT @hannelita

Mongo Connector

You

Points where's your Mongo

MC

#BuildStuffLT @hannelita

Mongo Connector

You

Points where's your Mongo

Points where is the other database

MC

DM

Elasticsearch

Solr

(Doc Manager)

Mongo Connector

MC

DM

Elasticsearch

Solr

(Doc Manager)

Creates a thread to watch Mongo Actions (replica)

Mongo Connector

MC

DM

Elasticsearch

Solr

(Doc Manager)

Creates a thread to watch Mongo Actions

Call actions on a Doc Manager

We can translate these actions

into a Graph Structure

Agenda

  • Quick note about document oriented databases
  • Graph databases can help your data model
  • Creating connectors for MongoDB
  • neo4j_doc_manager general architecture
  • Data mapping
  • Challenges

#BuildStuffLT @hannelita

Neo4j Doc Manager

mongo-connector (pip)

py2neo (neo4j)

#BuildStuffLT @hannelita

class DocManager(DocManagerBase):

  def __init__(self, url, auto_commit_interval=DEFAULT_COMMIT_INTERVAL,
                 unique_key='_id', chunk_size=DEFAULT_MAX_BULK, **kwargs):
    

  def upsert(self, doc, namespace, timestamp):

  def bulk_upsert(self, docs, namespace, timestamp):

  def update(self, document_id, update_spec, namespace, timestamp):

  def remove(self, document_id, namespace, timestamp):
    
  def search(self, start_ts, end_ts):

We can retrieve Mongo commands with this interface class

#BuildStuffLT @hannelita

We support Python 2 and Python 3

#BuildStuffLT @hannelita

It will run like an auto importer. You just need to provide the database endpoints

#BuildStuffLT @hannelita

We track the auto generated nodes with the label :Document

#BuildStuffLT @hannelita

Agenda

  • Quick note about document oriented databases
  • Graph databases can help your data model
  • Creating connectors for MongoDB
  • neo4j_doc_manager general architecture
  • Data mapping
  • Challenges

#BuildStuffLT @hannelita

Sync Mongo with Neo4j

#BuildStuffLT @hannelita

db.talks.insert(  { "session":

#BuildStuffLT @hannelita

db.talks.insert(  { "session":

#BuildStuffLT @hannelita

db.talks.insert(  { "session": ...

Document:talks

Root node in Neo4j

#BuildStuffLT @hannelita

{
  "session": {
    "title": "12 Years of Spring: An Open Source Journey"
  },
  "topics":  ["keynote", "spring"],
  "room": "Auditorium",
  "speaker": {
    "name": "Juergen Hoeller"
  }
}

#BuildStuffLT @hannelita

{
  "session": {
    "title": "12 Years of Spring: An Open Source Journey"
  },
  "topics":  ["keynote", "spring"],
  "room": "Auditorium",
  "speaker": {
    "name": "Juergen Hoeller"
  }
}

Document:session

Document:speaker

#BuildStuffLT @hannelita

{
  "session": {
    "title": "12 Years of Spring: An Open Source Journey"
  },
  "topics":  ["keynote", "spring"],
  "room": "Auditorium",
  "speaker": {
    "name": "Juergen Hoeller"
  }
}

#BuildStuffLT @hannelita

JSON properties become node properties

#BuildStuffLT @hannelita

All the nodes are connected to the root node

#BuildStuffLT @hannelita

#BuildStuffLT @hannelita

Nested documents

"session" : {
    "title" : "12 Years of Spring: An Open Source Journey",
    "abstract" : "Spring emerged as a core open source project in early 2003 and evolved to a broad portfolio of open source projects up until 2015.",
    "conference" : {
      "city" : "London"
    }
  }

#BuildStuffLT @hannelita

Nested documents

"session" : {
    "title" : "12 Years of Spring: An Open Source Journey",
    "abstract" : "Spring emerged as a core open source project in early 2003 and evolved to a broad portfolio of open source projects up until 2015.",
    "conference" : {
      "city" : "Dublin"
    }
  }

#BuildStuffLT @hannelita

Nested documents

Document:session

Document:conference

Child node

Parent node

#BuildStuffLT @hannelita

JSON array

"session" : { 
  "tracks": [{ "main":"Python" },
            { "second":"Data" }]
... }

#BuildStuffLT @hannelita

JSON array

Document:session

Document:track0

talks_track0

talks_track1

Document:track1

#BuildStuffLT @hannelita

We also support explicit ids to create a relationship

#BuildStuffLT @hannelita

Explicit ids

{
  "name": "Hanneli",
  "account_id": "32434ab2341192",
  "url": "medium.com/@hannelita"
}

session_account

Document:session

Document:account

#BuildStuffLT @hannelita

We also support a configuration file if you don't want to import all your data

#BuildStuffLT @hannelita

We can specify the namespaces that we want to import:

"include": ["test.talks", "docs.info"] (config.json file)

#BuildStuffLT @hannelita

It is also possible to specify the fields and collections via command line:

mongo-connector -m localhost:27017 -t http://localhost:7474/db/data -d neo4j_doc_manager -i room,timeslot,title

 

#BuildStuffLT @hannelita

Agenda

  • Quick note about document oriented databases
  • Graph databases can help your data model
  • Creating connectors for MongoDB
  • neo4j_doc_manager general architecture
  • Data mapping
  • Challenges

#BuildStuffLT @hannelita

1. Data model is a challenge.

#BuildStuffLT @hannelita

Different representations (Documents -> Graphs)

#BuildStuffLT @hannelita

2. Avoiding orphan nodes

#BuildStuffLT @hannelita

remove, set and unset commands can generate orphans

#BuildStuffLT @hannelita

3. Batching - maximum of 10k per batch

#BuildStuffLT @hannelita

Projects

mongo-conenctor: 

https://github.com/10gen-labs/mongo-connector

neo4j-doc-manager: 

https://github.com/neo4j-contrib/neo4j_doc_manager

 

#BuildStuffLT @hannelita

Next Projects

Neo4j Cassandra connector :) 

https://github.com/neo4j-contrib/neo4j-cassandra-connector

#BuildStuffLT @hannelita

Lessons learned

  • Polyglot persistence is great; be responsible!
  • Graphs can be very useful for simplifying queries
  • Real applications: fraud detection
  • University (UK) is using it :)

#BuildStuffLT @hannelita

Thank you :)

Questions?

 

hannelita@gmail.com

@hannelita

From Documents to Graphs - Buildstuff.lt

By Hanneli Tavante (hannelita)

From Documents to Graphs - Buildstuff.lt

  • 1,970