Data aggregation
(using MongoDB)

Agenda

(Very) quick tutorial on MongoDB
Very quick tutorial on the agg pipeline
And then the interesting part:
What are the performance implications of various operations
- Selecting
- Projecting
- Sorting
- Grouping
- Unwinding
- ...

Quick presentation on MongoDB

Document store
NoSQL - meaning, avoid joins
You can consider collection == table
Shards for fast writing, replication, etc
All is JavaScript/JSON (washing hands is a healthy habit anyway)
You don't have to define the schema right from the beginning
- In practice you'll want a static schema as much as possible but...
- ... It avoids hours long ALTER TABLE statements

Local install & import sample data

docker run --name mongo -d mongo:latest
docker exec -it mongo mongo

And you're in Mongo shell, where you can execute queries.

Ctrl+D to quit

Download https://github.com/ozlerhakan/mongodb-json-files/blob/master/datasets/country.json
and copy it to your docker container, then import it:

docker cp country.json mongo:.
docker exec -it mongo mongoimport country.json

Look at a doc

use test; // Default DB
db.countries.find({"translations.fra.common":"Samoa américaines"}).pretty()

Aggregation framework -> PIPELINE

Why "Pipeline"?

Each stage output is the input of the next.
Reducing the outputs is key to improve efficiency

Aggregation framework

There are a lot of possible stages.

$match => lets you select what documents you want to work on

$project: select what field you want, potentially changing their format

$unwind: Explodes a document by a specified array field (will create as many duplicated docs as items in the array)

$group: No surprise here

$sort: Same - note the 1/-1 is how you deal with ASC/DESC

$lookup: if you need to join collection (maybe use SQL?)

$graphLookup: Lookup in the same collection, recursively

$geoNear: geographical search

Aggregation framework

So let's play a bit!

Find what region has the most country

db.country.aggregate([

{$group: {

_id: "$region",

count:{"$sum":1}}

}]);

What region + subregion has the most contries, ordered

You will need some reading I guess:
https://docs.mongodb.com/manual/reference/operator/aggregation-pipeline/
https://docs.mongodb.com/manual/reference/operator/aggregation/

db.country.aggregate([

{$group: {

_id: {

continent:"$region",

subregion:"$subregion"

count:{"$sum":1}}

{$sort: {count:-1}}

]);

Example agg

db.country.aggregate([
   {"$project": {
     "name.common":1, 
     "languages_array": {
       "$objectToArray":"$languages"
     }
   }}, 
   {"$unwind":"$languages_array"}, 
   {"$group": {
     "_id": "$languages_array.v", 
     "count": {"$sum": 1}
   }}, 
   {"$match": {count:{"$gt":1}}}, 
   {"$sort": {count:-1}}
]);

Number of countries per language

Example agg

Number of language per country

db.country.aggregate([
   {"$project": {
     "name.common":1, 
     "languages_array": {
       "$objectToArray":"$languages"
     }
   }}, 
   {"$unwind":"$languages_array"}, 
   {"$group": {
     "_id": "$name.common", 
     "count": {"$sum": 1}
   }}, 
   {"$match": {count:{"$gt":1}}}, 
   {"$sort": {count:-1}}
]);

Data aggregation (using MongoDB)

Agenda

Quick presentation on MongoDB

Local install & import sample data

Look at a doc

Aggregation framework -> PIPELINE

Aggregation framework

Aggregation framework

Example agg

Example agg

Example agg

Data aggregation
(using MongoDB)