Bloggo

from 0 to Blog with Spring Boot and MongoDB

Who Am I?

Tonino Catapano

@tonnoz

Software Dev @

Drone racing hobbyist

Agenda

NOSQL and Mongo: a bit of History
Comparison with Relational & Why Mongo
CRUD (REPL vs Java)
Bloggo: a Spring Boot blog backed by MongoDB
Schema Design
Performance
Extra & Goodies

NoSQL?

A NoSQL database is a non (striclty) SQL database.
It provides a mechanism for storage and retrieval of data that is not modeled in tabular form like traditional RDBMS do. [wiki]

#NoSQL

Existed since 1960s, but they were called differently back then.
Only from 2009 they gained the name NoSQL because they needed a short hashtag to tweet about the first meetup in SF. Nobody expected to be the next big db movement!

Big Data

A term for data sets that are so large that traditional methods of storage & processing are inadequate

Massive increase in data volume within the last decade

Social networks, search engines etc..

Challenges in storage, capture, analysis, transfer etc..

Vertical vs Horizontal Scaling

Stuff was getting slower. Upgrading hardware was not enough anymore (Scale-in)

Big players said: «We need to change; I will develop my own data storage!»

Scale-in is expensive and many companies were going in the direction of having "many small clusters"

Situation today

Different NoSQL db for different purposes, they share common traits:

non relational

open-source

cluster-friendly

schema-less

Advantages of NoSQL DBs

If designed correctly, handles Big Data successfully

Extreme flexible Data Models - Schema less => but you need to handle the consistency of your data in the application layer

Cheap(er) to manage

Easy Scaling

Schema Migrations tools are optional - Flyway, fly away!

Advantages of Relational DBs

Relational Databases are better for relational data! (duh!)

Normalization (eg 3NF) => eliminate redundancy, less space required, faster access in some cases

SQL is well known language (it's been around for 30 years)

Data Integrity => using foreign key constraints help reducing inconsistency

A C I D

Mongo and JSON

Mongo is a document based DB: every model, query and data stored in MongoDB is a BSON that is an extension of JSON (Javascript Object Notation)

Can you tell how many tables in a normalized relational database would be needed to store the same amount information displayed on the right (single document)?

« In a nutshell, MongoDB allows developers to design data models that support common data access patterns.»

{
"title":"Game of Thrones",
"producers":["Television 360","Grok! Television",
             "Generator Entertainment"..],
"createdBy":["David Benioff", "D. B. Weiss"],
"firstSeasonReleased": 2009,
"locations":[{"country" : "Northern Ireland", 
              "city":"...", "dateShooted": [
              {..},..]} ],
"website":"https://www.hbo.com/game-of-thrones",
"characters":["...","..."],
"seasons":[{"ordinal": 1, "newCharacters":[],
            "plot": ""}....]
}

tv series

The _id field

All documents in MongoDB must have an _id field that is unique (key).

If you don't specify one when inserting, Mongo will create one for you as ObjectId() that is guaranteed to be unique within that collection.

WHY BSON?

lightweight

traversable

efficient
ISODate() natively
but MAX 16 MB

JSON value types are limited e.g.:

There is a single number type (no floats)
JSON doesn't support Date format natively
JSON does not support binary data

see details at bsonspec.org/spec.html

BSON on the other hand is:

C R U D

Create (DDL):

SQL: 
create table users(user_id varchar2(10),name varchar2(10),age number)

Mongo:
db.createCollection("users"); [optional]

Read:

```
SQL:
Select * from users Where age!=23
```
```
Mongo:
db.users.find({ age: {$ne:23} })
```

Update:

SQL:
UPDATE Customers
SET customerId = 1
WHERE customerID = 0

Mongo:
db.users.update({cutomerId:0},
{ $set: { customerId: 1} }, { multi: true })

Delete:

```
SQL:
DELETE FROM USERS WHERE ID = 6
```
```
Mongo:
db.users.remove({id:6})
```

https://docs.mongodb.com/manual/crud/

https://docs.mongodb.com/manual/reference/mongo-shell/

DDL in Mongo is less strict than SQL. Some DML statements in comparison:

Creating & Reading documents:

REPL!

The query language for mongoDB use a query-by-example strategy, let's give it a look!

Let's get Java: Document obj

Beside
There is also a type safe object called

It requires the explicit field type when declaring a Document => avoid Runtime Exceptions for type mismatch.

Document()

BsonDocument()

Exercise0: Java warm up

you@yourmachine:~$ git checkout exercise0

you@yourmachine:~$ git clone https://github.com/tonycatapano/Bloggo.git
or via ssh
you@yourmachine:~$ git clone git@github.com:tonycatapano/Bloggo.git

Bloggo: architecture

Bloggo: some clarifications

No fancy UI
No Single Page App
No models /pojos => no ODM
(dont try this at home)
Skipping all concerns regarding
Security (but please read: https://docs.mongodb.com/
manual/administration/security-checklist/ )

@MongoCollection(name = "comments")
public class Comment {
    @Id
    public String id;
    public String text;
}

public class BlogPost {
    @Id
    public String id;
    @ObjectId
    public List<DBRef<Comment, String>> comments;
}

BlogPost post = coll.findOneById(someId);
for (DBRef<Comment, String> comment : post.comments) {
    System.out.println(comment.fetch().text);
}

DAO Layer

Demo!

Schema Design

Starting situation:

Goals of normalization:

Free the DB from modification anomalies
Minimize re-design when extending
Avoid bias towards any particular access pattern

Thinking relational

No joins embedded in the kernel of Mongo
No constraint system
Lesson: If you find yourself thinking relational while designin with Mongo, you're most probably wrong

Designing with Mongo

In designing with Mongo you have to think out of the box. This is not relational.

Always tradeoff between linking or embedding.

Decision based on:

Data access pattern
cardinality
read/write frenquency

OneToOne relationship

Either link or Embed!

{
  _id: 20,
  name: "Ragnar",
  resume: 1
}

{
  _id:1,
  jobs:[],
  education:[],
  employee:20
}

Employee

Resume

Considering:

Frequency of access
Writing only a field all the time
Easy atomicity

{
  _id: 20,
  name: "Ragnar",
  resume: {
    jobs:[],
    education:[]
  }
}

{
  _id:1,
  jobs:[],
  education:[],
  employee: {
    name: "Ragnar"
  }
}

OneToMany relationship

{
  _id: 20,
  name: "Ragnar",
  ...
}

{
  _id:1,
  name: "NYC",
  inhabitants:[1,2,3],
...
}

Person

City

Solution: true linking

{
  _id: 20,
  name: "Ragnar",
  city: {name: "NYC", 
         zipCode: 123,
         population:8.538.000},
  ...
}

{
  _id: "1",
  name: "Ragnar",
  city: "NYC",
  ...
}

{
  _id:"NYC",
  zipCode: 123,
  population:8.538.000
}

OneTo"Few" relationship

when the Many are in the order of "few"

{
  _id: 1,
  name: "Mark",
  surname:"Zuckerberg",
  cars:["Ferrari", "Lamborghini"]
}

Person

EMBED!

ManyToMany relationship

{
  _id: 11,
  title: "The Matrix",
  director: "The Wachowski 
             'siblings'",
  producer: "Joel Silver",
  peopleInvolved: [1,2,3...,n]
}

{
  _id: 222,
  name: "Nicola Nardone",
  role: "bullet-time expert"
  workedOnMovies: [1,2,3,...,n]
}

Movies

People

this is ok because data does not change otherwise:
do not embed
two way linking only if necessary

{
  idMovie:11
  idPerson:22
}

PeopleMovies

"FewToFew" relationship

two way linking only if necessary

{
  _id: 20,
  title: "The Sagas of Ragnar Lodbrok",
  author:[1,2]
}

{
  _id:1,
  author_name:"Ben Waggoner",
  education:[],
  books:[11,20,42]
}

Book

Authors

do not embed

Representing trees (e.g. categories)

{
  _id: 10,
  category: 7,
  prod_name: "16 GB USB DRIVE 3.0"
}

{
  _id:7,
  category_name:"usb drives",
  parent:6
}

Product

Living without constraints and transactions

No Foreign keys ⇓ you have to enforce constraints manually in your app layer but embedding helps
No transactions =>

Restructure code: single document FTW

Implement locking, semaphores, critical sections...

Tolerate: e.g. Facebook feed

Bloggo: just 2 collections!

After what we just saw we can design our blog with just 2 (+1) collections:

1 for the posts & comments
1 for users
1 for session management

{
  "_id" : ObjectId("59e336472687c8737a5c2735"),
  "title" : "A RANDOM BLOG POST",
  "author" : "ragnar",
  "body" : "first post tahahhahaha",
  "tags" : [
            "uolo",
            "miesta",
            "asdoas",
            "cocacola"
	],
  "comments" : [ {
    "author" : "Tony",
    "email" : "info@toninocatapano.com",
    "body" : "this blog in Mongo sucks"
  } ],
  "date" : ISODate("2017-10-15T10:19:51.691Z"),
  "permalink" : "a_random_blog_post"
}

{
  "_id" : "ragnar",
  "password" : "HScSZiHvv73vv71kY++/ve+/vQrvv73vv71R77+9,-1847318508",
  "email" : "therealspartan77@vikings.org"
}

posts

users

Exercise1: Sign up and Sign in

you@yourmachine:~$ git checkout exercise1

Exercise2: Posts and comments

you@yourmachine:~$ git checkout exercise2

Performance

Two ways to impact latency & throughput of database queries:

adding indexes

distribute load across multiple servers with sharding

From Mongo 3.0 we have Pluggable Storage engines:

MMAPv1

WiredTiger (default since 3.2)

MMAPv1 vs Wired Tiger

Collection level concurrency (file)
In place updates: Documents are stored in power of 2 size bytes
No memory handling
Multiple readers, single writer

Document level concurrency
Compression (data & indexes)
Memory handling
No in place updates

MMAPv1

Wired Tiger (default since 3.2)

Indexes in theory

{name:"Doom III", genre:"FPS", price:40, extras:"Doom 1990 maps"}

videogames

collection on disk

indexes on name

Amnesia

Zelda

{..}

{..}

{..}

{..}

{..}

{..}

{..}

{..}

{..}

{..}

Indexes in practice

Will my app be available during indexes creation?

fast
block all writers & readers in db (even with W.T.)

slower
don't block readers nor writers
from 2.4+ multiple creation even on the same db

Foreground (default)

Background

alternative: replica set redirection

nb from 2.6+ after index creation on primary, will start on secondary automatically.

Index size

To benefits of the speed of having indexes make sure that the size of your indexes fits into your memory, let's see how to assess that.

Memory

Working set

Indexes

slow

Covered query

A query where all the queried fields AND projected fields are indexed. This kind of query is, as you may expect, the fastest.