Data aggregation
(using MongoDB)

Agenda

  • (Very) quick tutorial on MongoDB
  • Very quick tutorial on the agg pipeline
  • And then the interesting part:
  • What are the performance implications of various operations
    • Selecting
    • Projecting
    • Sorting
    • Grouping
    • Unwinding
    • ...

Quick presentation on MongoDB

  • Document store
  • NoSQL - meaning, avoid joins
  • You can consider collection == table
  • Shards for fast writing, replication, etc
  • All is JavaScript/JSON (washing hands is a healthy habit anyway)
  • You don't have to define the schema right from the beginning
    • In practice you'll want a static schema as much as possible but...
    • ... It avoids hours long ALTER TABLE statements

Local install & import sample data

docker run --name mongo -d mongo:latest
docker exec -it mongo mongo

And you're in Mongo shell, where you can execute queries.

Ctrl+D to quit

docker cp country.json mongo:.
docker exec -it mongo mongoimport country.json

Look at a doc

use test; // Default DB
db.countries.find({"translations.fra.common":"Samoa américaines"}).pretty()

Aggregation framework -> PIPELINE

Why "Pipeline"?

Each stage output is the input of the next.
Reducing the outputs is key to improve efficiency

Aggregation framework

There are a lot of possible stages.

 

$match => lets you select what documents you want to work on

$project: select what field you want, potentially changing their format

$unwind: Explodes a document by a specified array field (will create as many duplicated docs as items in the array)

$group: No surprise here

$sort: Same - note the 1/-1 is how you deal with ASC/DESC

 

$lookup: if you need to join collection (maybe use SQL?)

$graphLookup: Lookup in the same collection, recursively

$geoNear: geographical search

Aggregation framework

So let's play a bit!

Find what region has the most country

db.country.aggregate([

{$group: {

  _id: "$region",

  count:{"$sum":1}}

}]);

What region + subregion has the most contries, ordered

db.country.aggregate([

{$group: {

  _id: {

    continent:"$region",

    subregion:"$subregion"

  },

  count:{"$sum":1}}

},

{$sort: {count:-1}}

]);

Example agg

db.country.aggregate([
   {"$project": {
     "name.common":1, 
     "languages_array": {
       "$objectToArray":"$languages"
     }
   }}, 
   {"$unwind":"$languages_array"}, 
   {"$group": {
     "_id": "$languages_array.v", 
     "count": {"$sum": 1}
   }}, 
   {"$match": {count:{"$gt":1}}}, 
   {"$sort": {count:-1}}
]);

Number of countries per language

Example agg

Number of language per country

db.country.aggregate([
   {"$project": {
     "name.common":1, 
     "languages_array": {
       "$objectToArray":"$languages"
     }
   }}, 
   {"$unwind":"$languages_array"}, 
   {"$group": {
     "_id": "$name.common", 
     "count": {"$sum": 1}
   }}, 
   {"$match": {count:{"$gt":1}}}, 
   {"$sort": {count:-1}}
]);

Example agg

Made with Slides.com