Facebook: 20+ TB compressed/day
CERN: 40 TB / day (15 PB / day)
NYSE: 1 TB / day
R1: 4 TB / day
And growth is accelerating
Horizontal scalability = $$$
Distributed File System
Fault tolerant
Handles replication
Parallel computation
Scalable
(for all your graphic needs, contact amitteau@rhythmone.com)
Split the data in regards of one variable (eg. date) and allocate to various nodes (computers)
With a date partitioning, the Master will allocate:
N1
N2
(we will come back to that later... But don't do it, really)
This is like SQL but different
Hive is not a language, it's only a wrapper for MapReduce
+++ ETL (Extract Transform Load)
--- Update, Insert
A simple programming model that apply to many large scale computing problem
Outline stays the same, map and reduce change to fit the problem
Map
Reduce
Real life example:
SELECT
-- Reduce
pixel,
COUNT(uuid) AS nb_people
FROM radiumone_json.pixel_firings
WHERE
-- Map
dt BETWEEN 20180101 AND 20180115
AND pixel IN (45022, 45023)
GROUP BY pixel -- ReduceBecause when you launch a Hive SQL query, this is the only thing you will see to track it: