From documents

to graphs

#BuildStuffUA @hannelita

Slides at http://bit.ly/2gxklxp

 

This talk is about MongoDB and Neo4j :)

Code

http://bit.ly/2g15MPW

If you are bored, check my Assembly talk (BuildStuff LT) - lots of GIFs :)

 

http://bit.ly/2gxoIIF

Hi!

  • Computer Engineer
  • Programming
  • Electronics
  • Math <3 <3
  • Physics
  • Lego
  • Meetups
  • Animals
  • Coffee
  • GIFs
  • Pokémon

Disclaimer

Views are on my own

Project from late 2015

Mostly for Neo4j 2.x

Project

https://github.com/neo4j-contrib/neo4j_doc_manager

#BuildStuffUA @hannelita

Agenda

 

  • Quick note about document oriented databases
  • Graph databases can help your data model
  • Creating connectors for MongoDB
  • neo4j_doc_manager general architecture
  • Data mapping
  • Challenges

#BuildStuffUA @hannelita

"We need to restructure our data"

#BuildStuffUA @hannelita

"Relational databases are not enough"

#BuildStuffUA @hannelita

"Polyglot Databases"

"In 2006, Neal Ford coined the term

polyglot programming, to express the

idea that applications should be written in

a mix of languages to take advantage of

the fact that different languages are

suitable for tackling different problems.

Complex applications combine different

types of problems, so picking the right

language for each job may be more

productive than trying to fit all aspects into

a single language."

https://en.wikipedia.org/wiki/Polyglot_persistence

"Polyglot Databases"

#BuildStuffUA @hannelita

Document Oriented DB

#BuildStuffUA @hannelita

Document Oriented DB

  • Flexible data model
  • Easy to get started
  • Easy to represent the data

#BuildStuffUA @hannelita

Store data as Documents!

#BuildStuffUA @hannelita

Imagine that we have talks of a conference

#BuildStuffUA @hannelita

Our Documents

Agenda

  • Quick note about document oriented databases
  • Graph databases can help your data model
  • Creating connectors for MongoDB
  • neo4j_doc_manager general architecture
  • Data mapping
  • Challenges

#BuildStuffUA @hannelita

Sometimes we need to get some extra information

#BuildStuffUA @hannelita

Possible questions

  • Which talks have a specific topic (ex: 'Databases')
  • Which speakers will also talk about this topic?
  • What are the sessions that will be hold into Auditorium and are about this topic?

These are common questions

#BuildStuffUA @hannelita

More questions

  • Assuming that I do not want to change rooms, what is the best room to stay to get a higher number of sessions of a specific topic?

#BuildStuffUA @hannelita

Further work

  • Recommendation system for the talks
  • Recommendation system for speakers
  • Build a tool to automatically build the sessions timetable based on topic distribution

#BuildStuffUA @hannelita

Further work

It might not be intuitive how to build some of these 

structures with MongoDB

#BuildStuffUA @hannelita

Looks like we need some graphs!

Graphs are everywhere

TEAM, Neo4j

#BuildStuffUA @hannelita

We can build graphs with information from Mongo

#BuildStuffUA @hannelita

This session is about doing it automatically :)

#BuildStuffUA @hannelita

From Documents to Graphs

#BuildStuffUA @hannelita

Neo4j super quick reference

#BuildStuffUA @hannelita

Neo4j super quick reference

  • Graph oriented database
  • Pure graph structure that you can persist
  • Benefits of graph theory
  • Large and active community
  • Neotechnology

#BuildStuffUA @hannelita

Agenda

  • Quick note about document oriented databases
  • Graph databases can help your data model
  • Creating connectors for MongoDB
  • neo4j_doc_manager general architecture
  • Data mapping
  • Challenges

#BuildStuffUA @hannelita

MongoDB provides an interface to send data to other databases

#BuildStuffUA @hannelita

Mongo Connetor

https://github.com/10gen-labs/mongo-connector

#BuildStuffUA @hannelita

#BuildStuffUA @hannelita

Mongo Connector

You

MC

Mongo Connector

#BuildStuffUA @hannelita

Mongo Connector

You

Call Mongo Connector

MC

#BuildStuffUA @hannelita

Mongo Connector

You

Call Mongo Connector

with Replica

MC

Hi!

#BuildStuffUA @hannelita

Mongo Connector

You

Points where's your Mongo

MC

#BuildStuffUA @hannelita

Mongo Connector

You

Points where's your Mongo

Points where is the other database

MC

DM

Elasticsearch

Solr

(Doc Manager)

Mongo Connector

MC

DM

Elasticsearch

Solr

(Doc Manager)

Mongo Replica watches Mongo Actions

Call actions on a Doc Manager (custom interface for the Mongo Connector)

We can translate these actions

into a Graph Structure

Agenda

  • Quick note about document oriented databases
  • Graph databases can help your data model
  • Creating connectors for MongoDB
  • neo4j_doc_manager general architecture
  • Data mapping
  • Challenges

#BuildStuffUA @hannelita

Neo4j Doc Manager

mongo-connector (pip)

py2neo (neo4j)

#BuildStuffUA @hannelita

class DocManager(DocManagerBase):

  def __init__(self, url, auto_commit_interval=DEFAULT_COMMIT_INTERVAL,
                 unique_key='_id', chunk_size=DEFAULT_MAX_BULK, **kwargs):
    

  def upsert(self, doc, namespace, timestamp):

  def bulk_upsert(self, docs, namespace, timestamp):

  def update(self, document_id, update_spec, namespace, timestamp):

  def remove(self, document_id, namespace, timestamp):
    
  def search(self, start_ts, end_ts):

We can retrieve Mongo commands with this interface class

#BuildStuffUA @hannelita

We support Python 2 and Python 3

#BuildStuffUA @hannelita

It will run like an auto importer. You just need to provide the database endpoints

#BuildStuffUA @hannelita

We track the auto generated nodes with the label :Document

#BuildStuffUA @hannelita

How does it work? 

#BuildStuffUA @hannelita

When you start Neo4j Doc Manager, a first import will happen from MongoDB to Neo4j

#BuildStuffUA @hannelita

After that, insertion, updates and removals in MongoDB will also have an effect on Neo4j.

#BuildStuffUA @hannelita

#BuildStuffUA @hannelita

class DocManager(DocManagerBase):

  def __init__(self, url, auto_commit_interval=DEFAULT_COMMIT_INTERVAL,
                 unique_key='_id', chunk_size=DEFAULT_MAX_BULK, **kwargs):
    

  def upsert(self, doc, namespace, timestamp):

  def bulk_upsert(self, docs, namespace, timestamp):

  def update(self, document_id, update_spec, namespace, timestamp):

  def remove(self, document_id, namespace, timestamp):
    
  def search(self, start_ts, end_ts):

Agenda

  • Quick note about document oriented databases
  • Graph databases can help your data model
  • Creating connectors for MongoDB
  • neo4j_doc_manager general architecture
  • Data mapping
  • Challenges

#BuildStuffUA @hannelita

Sync Mongo with Neo4j

#BuildStuffUA @hannelita

db.talks.insert(  { "session":

#BuildStuffUA @hannelita

db.talks.insert(  { "session":

#BuildStuffUA @hannelita

db.talks.insert(  { "session": ...

Document:talks

Root node in Neo4j

#BuildStuffUA @hannelita

{
  "session": {
    "title": "12 Years of Spring: An Open Source Journey",
    "topics":  ["keynote", "spring"],
    "room": "Auditorium",
    "speaker": {
    "name": "Juergen Hoeller"
    }
  },
 "venue": "Olympia Stadium" 
}

#BuildStuffUA @hannelita

#BuildStuffUA @hannelita

Keys which values are another JSON become nodes

{
  "session": {
    "title": "12 Years of Spring: An Open Source Journey",
...
}

#BuildStuffUA @hannelita

{
  "session": {
    "title": "12 Years of Spring: An Open Source Journey",
...
}

Keys which values are another JSON become nodes

#BuildStuffUA @hannelita

{
  "session": {
    "title": "12 Years of Spring: An Open Source Journey",
...
}

Document:session

Keys which values are another JSON become nodes

#BuildStuffUA @hannelita

{
  "session": {
    "title": "12 Years of Spring: An Open Source Journey",
...
}

Document:session

Keys which values are another JSON become nodes

#BuildStuffUA @hannelita

They get a composite label - :Document + key

{
  "session": {
    "title": "12 Years of Spring: An Open Source Journey",
...
}

Document:session

#BuildStuffUA @hannelita

The JSON value of that key is translated into node properties

{
  "session": {
    "title": "12 Years of Spring: An Open Source Journey",
    "topics":  ["keynote", "spring"],
    "room": "Auditorium"
}

#BuildStuffUA @hannelita

{
  "session": {
    "title": "12 Years of Spring: An Open Source Journey",
    "topics":  ["keynote", "spring"],
    "room": "Auditorium"
}

Document:session

title: "12 Years of Spring: An Open Source Journey"
topics:  ["keynote", "spring"]
room: "Auditorium"

The node also gets Mongo Object properties (id and timestamp)

#BuildStuffUA @hannelita

#BuildStuffUA @hannelita

Document:session

title: "12 Years of Spring: An Open Source Journey"
topics:  ["keynote", "spring"]
room: "Auditorium"
_id: 324553ab342c324d7ff
_ts: 621002135233213

#BuildStuffUA @hannelita

All the top level generated nodes will be connected to the root node

Document:session

Document:talks

#BuildStuffUA @hannelita

The relationship is a concatenation of the keys:

Document:session

Document:talks

talks_session

#BuildStuffUA @hannelita

Top level properties go the the root node:

{
  "session": {
    ...
  },
 "venue": "Olympia Stadium" 
}

#BuildStuffUA @hannelita

Top level properties go the the root node:

{
  "session": {
    ...
  },
 "venue": "Olympia Stadium" 
}

Document:talks

venue: "Olympia Stadium"
_id: 324553ab342c324d7ff
_ts: 621002135233213

#BuildStuffUA @hannelita

Another example

{
  "session": {
    ...
  },
 "venue": {
    "address": "...",
    "city": "Kiev"
  }, 
}

How do you transform that into a graph structure, according to Neo4j Doc manager?

#BuildStuffUA @hannelita

#BuildStuffUA @hannelita

#BuildStuffUA @hannelita

Document:session

Document:talks

talks_session

Document:venue

talks_venue

Nested documents

"session" : {
    "title" : "12 Years of Spring: An Open Source Journey",
    "abstract" : "History os Spring Framework",
    "speaker" : {
      "name" : "Josh Long",
      "company" : "A company"
    }
  }

#BuildStuffUA @hannelita

Nested documents

#BuildStuffUA @hannelita

"session" : {
    "title" : "12 Years of Spring: An Open Source Journey",
    "abstract" : "History os Spring Framework",
    "speaker" : {
      "name" : "Josh Long",
      "company" : "A company"
    }
  }

We will keep the node chain:

#BuildStuffUA @hannelita

Nested documents

Document:session

Document:speaker

Child node

Parent node

session_speaker

#BuildStuffUA @hannelita

Don't forget the root node

Document:session

Document:speaker

Document:talks

#BuildStuffUA @hannelita

JSON array

"session" : { 
  "tracks": [{ "main":"Python" },
            { "second":"Data" }]
... }

#BuildStuffUA @hannelita

#BuildStuffUA @hannelita

JSON array

Document:session

Document:track0

talks_track0

talks_track1

Document:track1

#BuildStuffUA @hannelita

We also support explicit ids to create a relationship

#BuildStuffUA @hannelita

Explicit ids

"user": {
  "name": "Hanneli",
  "account_id": "32434ab2341192",
  "url": "medium.com/@hannelita"
}

#BuildStuffUA @hannelita

"account" : {
  "number": "326708",
  "id": "32434ab2341192"
}

Explicit ids

user_account

Document:user

Document:account

#BuildStuffUA @hannelita

We also support a configuration file if you don't want to import all your data

#BuildStuffUA @hannelita

We can specify the namespaces that we want to import:

"include": ["test.talks", 
"docs.info"] 
(config.json file)

#BuildStuffUA @hannelita

It is also possible to specify the fields and collections via command line:

mongo-connector -m
 localhost:27017 -t
 http://localhost:7474/db/data
 -d neo4j_doc_manager
 -i room,timeslot,title

 

#BuildStuffUA @hannelita

Agenda

  • Quick note about document oriented databases
  • Graph databases can help your data model
  • Creating connectors for MongoDB
  • neo4j_doc_manager general architecture
  • Data mapping
  • Challenges

#BuildStuffUA @hannelita

1. Data model is a challenge.

#BuildStuffUA @hannelita

Different representations (Documents -> Graphs)

#BuildStuffUA @hannelita

2. Avoiding orphan nodes

#BuildStuffUA @hannelita

remove, set and unset commands can generate orphans

#BuildStuffUA @hannelita

3. Batching - maximum of 10k per batch

#BuildStuffUA @hannelita

Projects

mongo-conenctor: 

https://github.com/10gen-labs/mongo-connector

neo4j-doc-manager: 

https://github.com/neo4j-contrib/neo4j_doc_manager

 

#BuildStuffUA @hannelita

Next Projects

Neo4j Cassandra connector :) 

https://github.com/neo4j-contrib/neo4j-cassandra-connector

#BuildStuffUA @hannelita

Lessons learned

  • Polyglot persistence is great; be responsible!
  • Graphs can be very useful for simplifying queries
  • Real applications: fraud detection
  • University (UK) is using it :)

#BuildStuffUA @hannelita

Thank you :)

Questions?

 

hannelita@gmail.com

@hannelita

From Documents to Graphs - Buildstuff.ua

By Hanneli Tavante (hannelita)

From Documents to Graphs - Buildstuff.ua

  • 3,510