From documents
to graphs
#BuildStuffLT @hannelita
Slides at http://bit.ly/2fYqONz
This talk is about MongoDB and Neo4j :)
Code
http://bit.ly/2g15MPW
Hi!
- Computer Engineer
- Programming
- Electronics
- Math <3 <3
- Physics
- Lego
- Meetups
- Animals
- Coffee
- GIFs
- Pokémon
Disclaimer
Views are on my own
Project from late 2015
Mostly for Neo4j 2.x
Project
https://github.com/neo4j-contrib/neo4j_doc_manager
#BuildStuffLT @hannelita
Agenda
- Quick note about document oriented databases
- Graph databases can help your data model
- Creating connectors for MongoDB
- neo4j_doc_manager general architecture
- Data mapping
- Challenges
#BuildStuffLT @hannelita
"We need to restructure our data"
#BuildStuffLT @hannelita
"Relational databases are not enough"
#BuildStuffLT @hannelita
"Polyglot Databases"
#BuildStuffLT @hannelita
Document Oriented DB
- Flexible data model
- Easy to get started
- Easy to represent the data
#BuildStuffLT @hannelita
Store data as Documents!
#BuildStuffLT @hannelita
Imagine that we have talks of a conference
#BuildStuffLT @hannelita
Our Documents
Agenda
- Quick note about document oriented databases
- Graph databases can help your data model
- Creating connectors for MongoDB
- neo4j_doc_manager general architecture
- Data mapping
- Challenges
#BuildStuffLT @hannelita
Sometimes we need to get some extra information
#BuildStuffLT @hannelita
Possible questions
- Which talks have a specific topic (ex: 'Databases')
- Which speakers will also talk about this topic?
- What are the sessions that will be hold into Auditorium and are about this topic?
These are common questions
#BuildStuffLT @hannelita
More questions
- Assuming that I do not want to change rooms, what is the best room to stay to get a higher number of sessions of a specific topic?
#BuildStuffLT @hannelita
Further work
- Recommendation system for the talks
- Recommendation system for speakers
- Build a tool to automatically build the sessions timetable based on topic distribution
#BuildStuffLT @hannelita
Looks like we need some graphs!
Graphs are everywhere
TEAM, Neo4j
#BuildStuffLT @hannelita
We can build graphs with information from Mongo
#BuildStuffLT @hannelita
Agenda
- Quick note about document oriented databases
- Graph databases can help your data model
- Creating connectors for MongoDB
- neo4j_doc_manager general architecture
- Data mapping
- Challenges
#BuildStuffLT @hannelita
From Documents to Graphs
#BuildStuffLT @hannelita
Neo4j super quick reference
- Graph oriented database
- Pure graph structure that you can persist
- Benefits of graph theory
- Large and active community
- Neotechnology
#BuildStuffLT @hannelita
Mongo Connetor
https://github.com/10gen-labs/mongo-connector
#BuildStuffLT @hannelita
Mongo Connector
You
MC
Mongo Connector
#BuildStuffLT @hannelita
Mongo Connector
You
Call Mongo Connector
MC
#BuildStuffLT @hannelita
Mongo Connector
You
Call Mongo Connector
MC
Hi!
#BuildStuffLT @hannelita
Mongo Connector
You
Points where's your Mongo
MC
#BuildStuffLT @hannelita
Mongo Connector
You
Points where's your Mongo
Points where is the other database
MC
DM
Elasticsearch
Solr
(Doc Manager)
Mongo Connector
MC
DM
Elasticsearch
Solr
(Doc Manager)
Creates a thread to watch Mongo Actions (replica)
Mongo Connector
MC
DM
Elasticsearch
Solr
(Doc Manager)
Creates a thread to watch Mongo Actions
Call actions on a Doc Manager
We can translate these actions
into a Graph Structure
Agenda
- Quick note about document oriented databases
- Graph databases can help your data model
- Creating connectors for MongoDB
- neo4j_doc_manager general architecture
- Data mapping
- Challenges
#BuildStuffLT @hannelita
Neo4j Doc Manager
mongo-connector (pip)
py2neo (neo4j)
#BuildStuffLT @hannelita
class DocManager(DocManagerBase):
def __init__(self, url, auto_commit_interval=DEFAULT_COMMIT_INTERVAL,
unique_key='_id', chunk_size=DEFAULT_MAX_BULK, **kwargs):
def upsert(self, doc, namespace, timestamp):
def bulk_upsert(self, docs, namespace, timestamp):
def update(self, document_id, update_spec, namespace, timestamp):
def remove(self, document_id, namespace, timestamp):
def search(self, start_ts, end_ts):
We can retrieve Mongo commands with this interface class
#BuildStuffLT @hannelita
We support Python 2 and Python 3
#BuildStuffLT @hannelita
It will run like an auto importer. You just need to provide the database endpoints
#BuildStuffLT @hannelita
We track the auto generated nodes with the label :Document
#BuildStuffLT @hannelita
Agenda
- Quick note about document oriented databases
- Graph databases can help your data model
- Creating connectors for MongoDB
- neo4j_doc_manager general architecture
- Data mapping
- Challenges
#BuildStuffLT @hannelita
Sync Mongo with Neo4j
#BuildStuffLT @hannelita
db.talks.insert( { "session":
#BuildStuffLT @hannelita
db.talks.insert( { "session":
#BuildStuffLT @hannelita
db.talks.insert( { "session": ...
Document:talks
Root node in Neo4j
#BuildStuffLT @hannelita
{ "session": { "title": "12 Years of Spring: An Open Source Journey" }, "topics": ["keynote", "spring"], "room": "Auditorium", "speaker": { "name": "Juergen Hoeller" } }
#BuildStuffLT @hannelita
{ "session": { "title": "12 Years of Spring: An Open Source Journey" }, "topics": ["keynote", "spring"], "room": "Auditorium", "speaker": { "name": "Juergen Hoeller" } }
Document:session
Document:speaker
#BuildStuffLT @hannelita
{ "session": { "title": "12 Years of Spring: An Open Source Journey" }, "topics": ["keynote", "spring"], "room": "Auditorium", "speaker": { "name": "Juergen Hoeller" } }
#BuildStuffLT @hannelita
JSON properties become node properties
#BuildStuffLT @hannelita
All the nodes are connected to the root node
#BuildStuffLT @hannelita
#BuildStuffLT @hannelita
Nested documents
"session" : { "title" : "12 Years of Spring: An Open Source Journey", "abstract" : "Spring emerged as a core open source project in early 2003 and evolved to a broad portfolio of open source projects up until 2015.", "conference" : { "city" : "London" } }
#BuildStuffLT @hannelita
Nested documents
"session" : { "title" : "12 Years of Spring: An Open Source Journey", "abstract" : "Spring emerged as a core open source project in early 2003 and evolved to a broad portfolio of open source projects up until 2015.", "conference" : { "city" : "Dublin" } }
#BuildStuffLT @hannelita
Nested documents
Document:session
Document:conference
Child node
Parent node
#BuildStuffLT @hannelita
JSON array
"session" : { "tracks": [{ "main":"Python" }, { "second":"Data" }] ... }
#BuildStuffLT @hannelita
JSON array
Document:session
Document:track0
talks_track0
talks_track1
Document:track1
#BuildStuffLT @hannelita
We also support explicit ids to create a relationship
#BuildStuffLT @hannelita
Explicit ids
{ "name": "Hanneli", "account_id": "32434ab2341192", "url": "medium.com/@hannelita" }
session_account
Document:session
Document:account
#BuildStuffLT @hannelita
We also support a configuration file if you don't want to import all your data
#BuildStuffLT @hannelita
We can specify the namespaces that we want to import:
"include": ["test.talks", "docs.info"] (config.json file)
#BuildStuffLT @hannelita
It is also possible to specify the fields and collections via command line:
mongo-connector -m localhost:27017 -t http://localhost:7474/db/data -d neo4j_doc_manager -i room,timeslot,title
#BuildStuffLT @hannelita
Agenda
- Quick note about document oriented databases
- Graph databases can help your data model
- Creating connectors for MongoDB
- neo4j_doc_manager general architecture
- Data mapping
- Challenges
#BuildStuffLT @hannelita
1. Data model is a challenge.
#BuildStuffLT @hannelita
Different representations (Documents -> Graphs)
#BuildStuffLT @hannelita
2. Avoiding orphan nodes
#BuildStuffLT @hannelita
remove, set and unset commands can generate orphans
#BuildStuffLT @hannelita
3. Batching - maximum of 10k per batch
#BuildStuffLT @hannelita
Projects
mongo-conenctor:
https://github.com/10gen-labs/mongo-connector
neo4j-doc-manager:
https://github.com/neo4j-contrib/neo4j_doc_manager
#BuildStuffLT @hannelita
Next Projects
Neo4j Cassandra connector :)
https://github.com/neo4j-contrib/neo4j-cassandra-connector
#BuildStuffLT @hannelita
Lessons learned
- Polyglot persistence is great; be responsible!
- Graphs can be very useful for simplifying queries
- Real applications: fraud detection
- University (UK) is using it :)
#BuildStuffLT @hannelita
Thank you :)
Questions?
hannelita@gmail.com
@hannelita
From Documents to Graphs - Buildstuff.lt
By Hanneli Tavante (hannelita)
From Documents to Graphs - Buildstuff.lt
- 2,124