Data aggregation
(using MongoDB)
Agenda
- (Very) quick tutorial on MongoDB
- Very quick tutorial on the agg pipeline
- And then the interesting part:
- What are the performance implications of various operations
- Selecting
- Projecting
- Sorting
- Grouping
- Unwinding
- ...
Quick presentation on MongoDB
- Document store
- NoSQL - meaning, avoid joins
- You can consider collection == table
- Shards for fast writing, replication, etc
- All is JavaScript/JSON (washing hands is a healthy habit anyway)
- You don't have to define the schema right from the beginning
- In practice you'll want a static schema as much as possible but...
- ... It avoids hours long ALTER TABLE statements
Local install & import sample data
docker run --name mongo -d mongo:latest docker exec -it mongo mongo
And you're in Mongo shell, where you can execute queries.
Ctrl+D to quit
docker cp country.json mongo:. docker exec -it mongo mongoimport country.json
Look at a doc
use test; // Default DB db.countries.find({"translations.fra.common":"Samoa américaines"}).pretty()
Aggregation framework -> PIPELINE
Why "Pipeline"?
Each stage output is the input of the next.
Reducing the outputs is key to improve efficiency
Aggregation framework
There are a lot of possible stages.
$match => lets you select what documents you want to work on
$project: select what field you want, potentially changing their format
$unwind: Explodes a document by a specified array field (will create as many duplicated docs as items in the array)
$group: No surprise here
$sort: Same - note the 1/-1 is how you deal with ASC/DESC
$lookup: if you need to join collection (maybe use SQL?)
$graphLookup: Lookup in the same collection, recursively
$geoNear: geographical search
Aggregation framework
So let's play a bit!
Find what region has the most country
db.country.aggregate([
{$group: {
_id: "$region",
count:{"$sum":1}}
}]);
What region + subregion has the most contries, ordered
You will need some reading I guess:
https://docs.mongodb.com/manual/reference/operator/aggregation-pipeline/
https://docs.mongodb.com/manual/reference/operator/aggregation/
db.country.aggregate([
{$group: {
_id: {
continent:"$region",
subregion:"$subregion"
},
count:{"$sum":1}}
},
{$sort: {count:-1}}
]);
Example agg
db.country.aggregate([ {"$project": { "name.common":1, "languages_array": { "$objectToArray":"$languages" } }}, {"$unwind":"$languages_array"}, {"$group": { "_id": "$languages_array.v", "count": {"$sum": 1} }}, {"$match": {count:{"$gt":1}}}, {"$sort": {count:-1}} ]);
Number of countries per language
Example agg
Number of language per country
db.country.aggregate([ {"$project": { "name.common":1, "languages_array": { "$objectToArray":"$languages" } }}, {"$unwind":"$languages_array"}, {"$group": { "_id": "$name.common", "count": {"$sum": 1} }}, {"$match": {count:{"$gt":1}}}, {"$sort": {count:-1}} ]);
Example agg
MongoDB aggregations
By fguery
MongoDB aggregations
- 283