Hui Hu Ph.D.
Department of Epidemiology
College of Public Health and Health Professions & College of Medicine
Which one(s) of them do you consider as big data?
Big data is a loosely defined term used to describe data sets so large and complex that they become awkward to work with using standard statistical software
dat.txt (150 MB)
64 MB
64 MB
22 MB
blk_1
blk_2
blk_3
blk_1
blk_2
blk_3
Datanodes
Namenode
blk_1
blk_2
blk_3
blk_1
blk_2
blk_3
Hadoop replicates each block three times
blk_1
blk_1
blk_2
blk_2
blk_3
blk_3
blk_1
blk_2
blk_3
blk_1
blk_1
blk_2
blk_2
blk_3
blk_3
NFS
Designed to process data in parallel
2017-01-01 | Miami | 2300.24 |
2017-01-01 | Gainesville | 3600.23 |
2017-01-02 | Jacksonville | 1900.34 |
2017-01-02 | Miami | 8900.45 |
Mappers
Miami
2300.24
NYC
9123.45
Boston
8123.45
LA
3123.45
Reducers
NYC, LA
Miami, Boston
NYC
7123.45
Boston
6123.45
NYC
7123.45
Boston
8123.45
Boston
6123.45
LA
3123.45
Miami
2300.24
NYC
9123.45
Mappers
Intermediate records (key, value)
Shuffle and sort
Reducers
(key, value)
Results
blk_1
blk_2
blk_3
blk_1
blk_1
blk_2
blk_2
blk_3
blk_3
Job Tracker
Task Trackers
date | time | store | item | cost | payment |
---|---|---|---|---|---|
2012-01-01 | 12:01 | Orlando | Music | 13.98 | Visa |
How to find total sales per store? (what is the key-value pair)
import sys
def mapper():
for line in sys.stdin:
data = line.strip().split("\t")
if len(data) == 6:
date, time, store, item, cost, payment = data
print "{0}\t{1}".format(store, cost)
Store, Cost
def reducer():
salesTotal = 0
oldKey = None
for line in sys.stdin:
data = line.strip().split("\t")
if len(data) != 2:
continue
thisKey, thisSale = data
if oldKey and oldKey != thisKey:
print "{0}\t{1}".format(oldkey, salesTotal)
salesTotal = 0
oldKey = thisKey
salesTotal += float(thisSale)
if oldKey != None:
print "{0}\t{1}".format(oldkey, salesTotal)
NYC | 12.00 |
NYC | 13.11 |
LA | 11.23 |
LA | 12.34 |
LA | 11.98 |