From documents
to graphs
#BuildStuffUA @hannelita
Slides at http://bit.ly/2gxklxp
This talk is about MongoDB and Neo4j :)
Code
http://bit.ly/2g15MPW
If you are bored, check my Assembly talk (BuildStuff LT) - lots of GIFs :)
http://bit.ly/2gxoIIF
Hi!
- Computer Engineer
- Programming
- Electronics
- Math <3 <3
- Physics
- Lego
- Meetups
- Animals
- Coffee
- GIFs
- Pokémon
Disclaimer
Views are on my own
Project from late 2015
Mostly for Neo4j 2.x
Project
https://github.com/neo4j-contrib/neo4j_doc_manager
#BuildStuffUA @hannelita
Agenda
- Quick note about document oriented databases
- Graph databases can help your data model
- Creating connectors for MongoDB
- neo4j_doc_manager general architecture
- Data mapping
- Challenges
#BuildStuffUA @hannelita
"We need to restructure our data"
#BuildStuffUA @hannelita
"Relational databases are not enough"
#BuildStuffUA @hannelita
"Polyglot Databases"
"In 2006, Neal Ford coined the term
polyglot programming, to express the
idea that applications should be written in
a mix of languages to take advantage of
the fact that different languages are
suitable for tackling different problems.
Complex applications combine different
types of problems, so picking the right
language for each job may be more
productive than trying to fit all aspects into
a single language."
https://en.wikipedia.org/wiki/Polyglot_persistence
"Polyglot Databases"
#BuildStuffUA @hannelita
Document Oriented DB
#BuildStuffUA @hannelita
Document Oriented DB
- Flexible data model
- Easy to get started
- Easy to represent the data
#BuildStuffUA @hannelita
Store data as Documents!
#BuildStuffUA @hannelita
Imagine that we have talks of a conference
#BuildStuffUA @hannelita
Our Documents
Agenda
- Quick note about document oriented databases
- Graph databases can help your data model
- Creating connectors for MongoDB
- neo4j_doc_manager general architecture
- Data mapping
- Challenges
#BuildStuffUA @hannelita
Sometimes we need to get some extra information
#BuildStuffUA @hannelita
Possible questions
- Which talks have a specific topic (ex: 'Databases')
- Which speakers will also talk about this topic?
- What are the sessions that will be hold into Auditorium and are about this topic?
These are common questions
#BuildStuffUA @hannelita
More questions
- Assuming that I do not want to change rooms, what is the best room to stay to get a higher number of sessions of a specific topic?
#BuildStuffUA @hannelita
Further work
- Recommendation system for the talks
- Recommendation system for speakers
- Build a tool to automatically build the sessions timetable based on topic distribution
#BuildStuffUA @hannelita
Further work
It might not be intuitive how to build some of these
structures with MongoDB
#BuildStuffUA @hannelita
Looks like we need some graphs!
Graphs are everywhere
TEAM, Neo4j
#BuildStuffUA @hannelita
We can build graphs with information from Mongo
#BuildStuffUA @hannelita
This session is about doing it automatically :)
#BuildStuffUA @hannelita
From Documents to Graphs
#BuildStuffUA @hannelita
Neo4j super quick reference
#BuildStuffUA @hannelita
Neo4j super quick reference
- Graph oriented database
- Pure graph structure that you can persist
- Benefits of graph theory
- Large and active community
- Neotechnology
#BuildStuffUA @hannelita
Agenda
- Quick note about document oriented databases
- Graph databases can help your data model
- Creating connectors for MongoDB
- neo4j_doc_manager general architecture
- Data mapping
- Challenges
#BuildStuffUA @hannelita
MongoDB provides an interface to send data to other databases
#BuildStuffUA @hannelita
Mongo Connetor
https://github.com/10gen-labs/mongo-connector
#BuildStuffUA @hannelita
#BuildStuffUA @hannelita
Mongo Connector
You
MC
Mongo Connector
#BuildStuffUA @hannelita
Mongo Connector
You
Call Mongo Connector
MC
#BuildStuffUA @hannelita
Mongo Connector
You
Call Mongo Connector
with Replica
MC
Hi!
#BuildStuffUA @hannelita
Mongo Connector
You
Points where's your Mongo
MC
#BuildStuffUA @hannelita
Mongo Connector
You
Points where's your Mongo
Points where is the other database
MC
DM
Elasticsearch
Solr
(Doc Manager)
Mongo Connector
MC
DM
Elasticsearch
Solr
(Doc Manager)
Mongo Replica watches Mongo Actions
Call actions on a Doc Manager (custom interface for the Mongo Connector)
We can translate these actions
into a Graph Structure
Agenda
- Quick note about document oriented databases
- Graph databases can help your data model
- Creating connectors for MongoDB
- neo4j_doc_manager general architecture
- Data mapping
- Challenges
#BuildStuffUA @hannelita
Neo4j Doc Manager
mongo-connector (pip)
py2neo (neo4j)
#BuildStuffUA @hannelita
class DocManager(DocManagerBase):
def __init__(self, url, auto_commit_interval=DEFAULT_COMMIT_INTERVAL,
unique_key='_id', chunk_size=DEFAULT_MAX_BULK, **kwargs):
def upsert(self, doc, namespace, timestamp):
def bulk_upsert(self, docs, namespace, timestamp):
def update(self, document_id, update_spec, namespace, timestamp):
def remove(self, document_id, namespace, timestamp):
def search(self, start_ts, end_ts):
We can retrieve Mongo commands with this interface class
#BuildStuffUA @hannelita
We support Python 2 and Python 3
#BuildStuffUA @hannelita
It will run like an auto importer. You just need to provide the database endpoints
#BuildStuffUA @hannelita
We track the auto generated nodes with the label :Document
#BuildStuffUA @hannelita
How does it work?
#BuildStuffUA @hannelita
When you start Neo4j Doc Manager, a first import will happen from MongoDB to Neo4j
#BuildStuffUA @hannelita
After that, insertion, updates and removals in MongoDB will also have an effect on Neo4j.
#BuildStuffUA @hannelita
#BuildStuffUA @hannelita
class DocManager(DocManagerBase):
def __init__(self, url, auto_commit_interval=DEFAULT_COMMIT_INTERVAL,
unique_key='_id', chunk_size=DEFAULT_MAX_BULK, **kwargs):
def upsert(self, doc, namespace, timestamp):
def bulk_upsert(self, docs, namespace, timestamp):
def update(self, document_id, update_spec, namespace, timestamp):
def remove(self, document_id, namespace, timestamp):
def search(self, start_ts, end_ts):
Agenda
- Quick note about document oriented databases
- Graph databases can help your data model
- Creating connectors for MongoDB
- neo4j_doc_manager general architecture
- Data mapping
- Challenges
#BuildStuffUA @hannelita
Sync Mongo with Neo4j
#BuildStuffUA @hannelita
db.talks.insert( { "session":
#BuildStuffUA @hannelita
db.talks.insert( { "session":
#BuildStuffUA @hannelita
db.talks.insert( { "session": ...
Document:talks
Root node in Neo4j
#BuildStuffUA @hannelita
{ "session": { "title": "12 Years of Spring: An Open Source Journey", "topics": ["keynote", "spring"], "room": "Auditorium", "speaker": { "name": "Juergen Hoeller" } }, "venue": "Olympia Stadium" }
#BuildStuffUA @hannelita
#BuildStuffUA @hannelita
Keys which values are another JSON become nodes
{ "session": { "title": "12 Years of Spring: An Open Source Journey", ... }
#BuildStuffUA @hannelita
{ "session": { "title": "12 Years of Spring: An Open Source Journey", ... }
Keys which values are another JSON become nodes
#BuildStuffUA @hannelita
{ "session": { "title": "12 Years of Spring: An Open Source Journey", ... }
Document:session
Keys which values are another JSON become nodes
#BuildStuffUA @hannelita
{ "session": { "title": "12 Years of Spring: An Open Source Journey", ... }
Document:session
Keys which values are another JSON become nodes
#BuildStuffUA @hannelita
They get a composite label - :Document + key
{ "session": { "title": "12 Years of Spring: An Open Source Journey", ... }
Document:session
#BuildStuffUA @hannelita
The JSON value of that key is translated into node properties
{ "session": { "title": "12 Years of Spring: An Open Source Journey",
"topics": ["keynote", "spring"], "room": "Auditorium" }
#BuildStuffUA @hannelita
{ "session": { "title": "12 Years of Spring: An Open Source Journey",
"topics": ["keynote", "spring"], "room": "Auditorium" }
Document:session
title: "12 Years of Spring: An Open Source Journey"
topics: ["keynote", "spring"]
room: "Auditorium"
The node also gets Mongo Object properties (id and timestamp)
#BuildStuffUA @hannelita
#BuildStuffUA @hannelita
Document:session
title: "12 Years of Spring: An Open Source Journey"
topics: ["keynote", "spring"]
room: "Auditorium" _id: 324553ab342c324d7ff _ts: 621002135233213
#BuildStuffUA @hannelita
All the top level generated nodes will be connected to the root node
Document:session
Document:talks
#BuildStuffUA @hannelita
The relationship is a concatenation of the keys:
Document:session
Document:talks
talks_session
#BuildStuffUA @hannelita
Top level properties go the the root node:
{ "session": { ... }, "venue": "Olympia Stadium" }
#BuildStuffUA @hannelita
Top level properties go the the root node:
{ "session": { ... }, "venue": "Olympia Stadium" }
Document:talks
venue: "Olympia Stadium" _id: 324553ab342c324d7ff _ts: 621002135233213
#BuildStuffUA @hannelita
Another example
{ "session": { ... }, "venue": { "address": "...", "city": "Kiev" }, }
How do you transform that into a graph structure, according to Neo4j Doc manager?
#BuildStuffUA @hannelita
#BuildStuffUA @hannelita
#BuildStuffUA @hannelita
Document:session
Document:talks
talks_session
Document:venue
talks_venue
Nested documents
"session" : { "title" : "12 Years of Spring: An Open Source Journey", "abstract" : "History os Spring Framework", "speaker" : { "name" : "Josh Long", "company" : "A company" } }
#BuildStuffUA @hannelita
Nested documents
#BuildStuffUA @hannelita
"session" : { "title" : "12 Years of Spring: An Open Source Journey", "abstract" : "History os Spring Framework", "speaker" : { "name" : "Josh Long", "company" : "A company" } }
We will keep the node chain:
#BuildStuffUA @hannelita
Nested documents
Document:session
Document:speaker
Child node
Parent node
session_speaker
#BuildStuffUA @hannelita
Don't forget the root node
Document:session
Document:speaker
Document:talks
#BuildStuffUA @hannelita
JSON array
"session" : { "tracks": [{ "main":"Python" }, { "second":"Data" }] ... }
#BuildStuffUA @hannelita
#BuildStuffUA @hannelita
JSON array
Document:session
Document:track0
talks_track0
talks_track1
Document:track1
#BuildStuffUA @hannelita
We also support explicit ids to create a relationship
#BuildStuffUA @hannelita
Explicit ids
"user": { "name": "Hanneli", "account_id": "32434ab2341192", "url": "medium.com/@hannelita" }
#BuildStuffUA @hannelita
"account" : { "number": "326708", "id": "32434ab2341192" }
Explicit ids
user_account
Document:user
Document:account
#BuildStuffUA @hannelita
We also support a configuration file if you don't want to import all your data
#BuildStuffUA @hannelita
We can specify the namespaces that we want to import:
"include": ["test.talks", "docs.info"] (config.json file)
#BuildStuffUA @hannelita
It is also possible to specify the fields and collections via command line:
mongo-connector -m localhost:27017 -t http://localhost:7474/db/data -d neo4j_doc_manager -i room,timeslot,title
#BuildStuffUA @hannelita
Agenda
- Quick note about document oriented databases
- Graph databases can help your data model
- Creating connectors for MongoDB
- neo4j_doc_manager general architecture
- Data mapping
- Challenges
#BuildStuffUA @hannelita
1. Data model is a challenge.
#BuildStuffUA @hannelita
Different representations (Documents -> Graphs)
#BuildStuffUA @hannelita
2. Avoiding orphan nodes
#BuildStuffUA @hannelita
remove, set and unset commands can generate orphans
#BuildStuffUA @hannelita
3. Batching - maximum of 10k per batch
#BuildStuffUA @hannelita
Projects
mongo-conenctor:
https://github.com/10gen-labs/mongo-connector
neo4j-doc-manager:
https://github.com/neo4j-contrib/neo4j_doc_manager
#BuildStuffUA @hannelita
Next Projects
Neo4j Cassandra connector :)
https://github.com/neo4j-contrib/neo4j-cassandra-connector
#BuildStuffUA @hannelita
Lessons learned
- Polyglot persistence is great; be responsible!
- Graphs can be very useful for simplifying queries
- Real applications: fraud detection
- University (UK) is using it :)
#BuildStuffUA @hannelita
Thank you :)
Questions?
hannelita@gmail.com
@hannelita
From Documents to Graphs - Buildstuff.ua
By Hanneli Tavante (hannelita)
From Documents to Graphs - Buildstuff.ua
- 3,466