HAdoOP on EMR
Barrett Strausser
@FuncBearrito
Navigation
- Main Flow use left and right keys
- Code, Configuration and Miscellany use up and down
- If you want an overview hit 'ESC'
Lots of MATERIAL
- What is MapReduce
- Setup on AWS
- Basics of Mappers and Reducer by example
- Another MapReduce Example
- HIVE
- PIG
Lots of material. Each topic deserves a more thorough treatment.
Some non-pythonic code. I tried to make things language agnostic
Some non-pythonic code. I tried to make things language agnostic
Map Reduce
- An algorithm not a framework
- Might better be called emit and aggregate
- It is about selecting or projecting or MAPPING
- And then aggregating or summing or REDUCING
- It is functional.
- You've seen it before. Who writes python...?
- Grok the below nonsense for great learning....
Types:
map (k1,v1) => list((k2,v2))
reduce (k2,list(v2)) => list(v2)
Example - Simple Web Index
Types:
map (url,content) => list(word,url)
reduce (word,list(url)) => list((word,set(url))
- Input a webpage model of form (url,content)
- Map(url, content) to (word, url) by splitting content into words.
- Reduce (word, list(url)) by emitting (word,set(url))
For the visual learners
Hadoop FrameWork
- Wait just a minute.... What was that sort phase
- Hadoop will GUARANTEE that all tuples with the same key will be reduced by the same reducer instance AND in sorted order
- Meaning, you will have everything you need locally to aggregate on a GIVEN KEY
- Hadoop WILL NOT GUARANTEE that a reducer receives only one key.
More INDepth
- Input reader - reads from a stable source : HDFS, S3, SomeNoSql
- Mapper - splits input into (key,value)
- Partition Function - Function(key) => Reducer label
- Comparison - Sort Phase
- Reduce - Applies a function to each input ONCE. May or may not output result.
More visuals
Hadoop
- MapReduce + HDFS
- HDFS = Hadoop File System
- I mostly get by ignoring the details of trackers, job nodes as I'm concerned with productivity than optimal settings
- Your mileage may vary
Hadoop
- JobTracker to which client applications submit jobs.
- JobTracker is rack-aware
- TaskTracker - Runs jobs in separate JVM
- Data is 3-redundant
- Heartbeat between JobTracker and TaskTracker
AWS
- I prefer Amazon EMR (Elastic Map Reduce)
- My data is already there (S3)
- I can use tools I know (boto)
- Integrates well with the other AWS products (dynamodb)
Aws Tool Setup
- setup is by no means exhaustive
- miminal steps needed
sudo vi ~/.bashrc || sudo vi ~/.bash_profile
export EC2_HOME = ...
export EC2_PRIVATE_KEY=pk-foo.pem
export EC2_CERT=cert-foo.pem
export EC2_KEYPAIR=...
export EC2_URL=...
$sudo apt-get install python,pip
$pip install boto
Our TASK
- We have (sensors,logs whatever) that output timeseries of records
- Each record consist of three different signals (channels)
- Our task is to estimate the statistical distribution of the channels given some belief about their distribution
- WHY ALL THE MATH? Easy to check your answers
Example Schema
- Machine Name = Identifier of Sensor. Ranging from 1-3
- Record_Date = Timestamp of sensor reading.
- Channel_1 = Normally Distributed N(i,1). i = machine id
- Channel_2 = Exponentially Distributed with lambda =3
- Channel_3 = Lognormally Distributed LN(0,2)
Machine-1 2012-11-28 22:42:00 0.1722 0.108 6.2504
Machine-2 2012-11-28 22:42:00 0.0185 0.3336 3.316
Machine-3 2012-11-28 22:42:00 1.6843 0.2725 1.3314
Machine-1 2012-11-28 22:42:00 0.1482 0.1422 0.3965
Model Code
records = []
with open('/home/me/Git/EMR-DEMO/code/resources/mapper_input','w+') as f:
for i in range(10000):
input = machine_name = " Machine-" + str((i % 3) + 1) + '\t'
input += str(datetime.datetime.today().replace(second=0, microsecond=0)) + '\t'
input += str(round(random.normalvariate((i % 3),1),4)) + '\t'
input += str(round( random.expovariate(3),4) )+ '\t'
input += str(round(random.lognormvariate(0,2),4)) + '\n'
f.write(input)
Mapper
- Read the files in from STDIN
- Split on '\t' to find the key and values
- Your input format determines how you parse
- Process the values however you wish
- Emit a (key, value) pair by printing to STDOUT
Code
#!/usr/bin/env python
# encoding: utf-8
import sys
import decimal
def some_function(sensor_record):
numerical_record = []
for record in sensor_record:
scaled = decimal.Decimal(record).quantize(decimal.Decimal('.001'),rounding=decimal.ROUND_DOWN)
numerical_record.append(scaled)
return ''.join(map(lambda x: str(x) + ' ',numerical_record))
def record_cleaner(record):
record = record.strip()
(key,date,c1,c2,c3) = record.split('\t')
values = [c1,c2,c3]
return (key,values)
def process(count,record) :
count += 1
(key,values) = record_cleaner(record)
transformed_data = some_function(values)
output = '%s\t%s:%s' % (key,count,transformed_data)
print output
return count
count = 0
for record in sys.stdin :
count = process(count,record)
Reducer
- The reducer is very similar to the mapper
- You read in from StdIn
- You split on '\t' into key and value
- You then process the records how you wish
- Write the results to StdOut
Reducer WalkThrough
- Start reading in records
import sys
import decimal
import math
machine_name = None
key = ''
machine = 'foo'
print >> sys.stderr, 'Starting'
for record in sys.stdin:
if(record != None and record != '') :
(k,i,v) = record_cleaner(record)
if(machine_name == None):
record_count = 0
running_sum = [0,0,0]
machine_name = k
if(machine_name != k):
estimates = estimate_parameters(record_count,running_sum)
print '%s\t%s:%s:%s:%s' % (machine_name,estimates[0],estimates[1],estimates[2],estimates[3])
machine_name = k
#print >> sys.stderr, 'machine ' + machine_name
record_count = 0
running_sum = [0,0,0]
(record_count,running_sum) = process(record_count,running_sum,record)
else:
#print >> sys.stderr, 'processing %s' % record
(record_count,running_sum) = process(record_count,running_sum,record)
estimates = estimate_parameters(record_count,running_sum)
print '%s\t%s:%s:%s:%s' % (machine_name,estimates[0],estimates[1],estimates[2],estimates[3])
Reducer WalkThrough
- Core Logic
def process(record_count,running_sum,record):
record_count += 1
record = record_cleaner(record)
data_tuple = record_to_datatuple(record[2])
running_sum = compute_sum(running_sum,data_tuple)
t = (record_count,running_sum)
return t
Reducer WalkThrough
- Split into key and value(s) based on file format
def record_cleaner(record):
record = record.strip()
split_record = record.split('\t')
key = split_record[0]
split_record = split_record[1].split(":")
id = split_record[0]
value = split_record[1]
return (key,id,value)
Reducer WalkThrough
- Process into a list and do some numerical manipulation
def record_to_datatuple(record):
split_record = record.split(" ")
numerical_record = []
for record in split_record:
scaled = decimal.Decimal(record).quantize(decimal.Decimal('.001'),rounding=decimal.ROUND_DOWN)
numerical_record.append(scaled)
return numerical_record
Reducer WalkThrough
- Keep a running sum as we iterate
def compute_sum(running_sum,data_tuple):
try:
running_sum[0] += data_tuple[0]
running_sum[1] += data_tuple[1]
running_sum[2] += data_tuple[2]
except :
print >> sys.stderr, 'error in running sum with tuple %s' % data_tuple
return running_sum
Reducer WalkThrough
- Finally compute the estimates
def estimate_parameters(record_count,running_sum):
normal_estimate = running_sum[0] / record_count
exponential_estimate = 1 / (running_sum[1] /record_count)
log_sum = running_sum[2]
log_u = log_sum / record_count
log_numerator = log_sum*log_sum - log_sum*log_u + log_u*log_u
log_sigma = log_numerator / record_count
return (normal_estimate,exponential_estimate,log_u,log_sigma)
Job Context
- Great. Let's burn some dust.
import boto
from boto.s3.key import Key
from boto.emr.step import StreamingStep
#you will need your own bucket
root_path = '/home/me/Git/EMR-DEMO/code/'
s3 = boto.connect_s3()
emr_demo_bucket = s3.create_bucket('bearrito.demos.emr')
emr_demo_bucket.set_acl('private')
Job Context
- Upload all our scripts
json_records = Key(emr_demo_bucket)
json_records.key = "input/hive/mapper_input"
json_records.set_contents_from_filename(root_path + 'resources/mapper_input')
input_key = Key(emr_demo_bucket)
input_key.key = "input/0/mapper_input"
input_key.set_contents_from_filename( root_path + 'resources/mapper_input')
mapper_key = Key(emr_demo_bucket)
mapper_key.key = "scripts/mapper_script.py"
mapper_key.set_contents_from_filename(root_path + 'src/EMRDemoMapper.py')
reducer_key = Key(emr_demo_bucket)
reducer_key.key = "scripts/reducer_script.py"
reducer_key.set_contents_from_filename( root_path + 'src/EMRDemoReducer.py')
Job Context
- Run and Monitor the Job.
demo_step = StreamingStep(name ='EMR Demo Example'
,mapper='s3://bearrito.demos.emr/scripts/mapper_script.py'
,reducer='s3://bearrito.demos.emr/scripts/reducer_script.py'
,input='s3://bearrito.demos.emr/input/0'
,output='s3://bearrito.demos.emr/output')
emr = boto.connect_emr()
jobid = emr.run_jobflow(name="EMR Example",log_uri='s3://bearrito.demos.logs',steps = [demo_step])
status = emr.describe_jobflow(jobid)
#log into AWS to monitor further
More EXAMPLES!
- The last example was focused on aggregation
- What about a search example
- Given a target record find the machine that generated the record closest to that record
- Our approach is going to work in phases
- Given inputs, emit a (machine_name, distance) pair
- Foreach machine_name emit the smallest distance
- Collapse the set of min distance onto a single key
- Loop over and emit the smallest.
Mapper Phase 0
- Given a target emit a machine name and distance.
#!/usr/bin/env python
# encoding: utf-8
import decimal
import math
import sys
def record_to_decimal(sensor_record):
numerical_record = []
for record in sensor_record:
scaled = decimal.Decimal(record).quantize(decimal.Decimal('.001'),rounding=decimal.ROUND_DOWN)
numerical_record.append(scaled)
return numerical_record
def record_cleaner(record):
#print(record)
record = record.strip()
(super_key,record_date,c1,c2,c3) = record.split('\t')
key = super_key[super_key.index('Machine'):]
values = [c1,c2,c3]
#print(values)
return (key,values)
def compute_distance(target,data):
d0 = (target[0] - data[0])**2
d1 = (target[1] - data[1])**2
d2 = (target[2] - data[2])**2
return math.sqrt(d0 + d1 + d2)
def process(target,record) :
(key,values) = record_cleaner(record)
transformed_data = record_to_decimal(values)
distance = compute_distance(target,transformed_data)
output = '%s\t%s' % (key,distance)
print output
target = (decimal.Decimal(-1.53),decimal.Decimal(0.144),decimal.Decimal(1.99))
for record in sys.stdin:
process(target,record)
Reducer Phase 0
- For a given key(machine) emit the closest record
#!/usr/bin/env python
# encoding: utf-8
import decimal
import sys
def process(min_machine,min_distance,record) :
(key,value) = record.split("\t")
data = decimal.Decimal(value).quantize(decimal.Decimal('.001'),rounding=decimal.ROUND_DOWN)
if(min_machine == None) :
min_machine = key
min_distance = None
if(min_machine != key) :
print "%s\t%s" % (min_machine,min_distance)
min_machine = key
min_distance = None
if(min_distance == None or data < min_distance) :
min_distance = data
return (min_machine,min_distance)
min_distance = None
min_machine = None
for record in sys.stdin :
(min_machine,min_distance) = process(min_machine,min_distance,record)
print "%s\t%s" % (min_machine,min_distance)
Mapper Phase 1
- Given all the min keys, project them onto a common key.
#!/usr/bin/env python
# encoding: utf-8
import sys
for record in sys.stdin :
(key,value) = record.split('\t')
print "%s\t%s:%s" % ('agg',key,value)
Reducer Phase 1
- Loop over all min keys and emit the smallest
#!/usr/bin/env python
# encoding: utf-8
import decimal
import sys
def process(min_machine,min,record) :
try :
(key,value) = record.split('\t')
(machine,distance) = value.split(':')
data = decimal.Decimal(distance).quantize(decimal.Decimal('.001'),rounding=decimal.ROUND_DOWN)
if(min == None or data < min) :
min = data
min_machine = machine
except :
a = 1
return (min_machine,min)
min = None
min_machine = None
for record in sys.stdin :
if(record != None or record != ""):
(min_machine,min) = process(min_machine,min,record)
print "%s\t%s" % (min_machine,min)
Job Setup
- Setup our code
- Notice the boostrap step
#!/usr/bin/env python
# encoding: utf-8
import boto
from boto.s3.key import Key
from boto.emr.step import StreamingStep
from boto.emr.bootstrap_action import BootstrapAction
root_path = '/home/barrett/Git/EMR-DEMO/code/'
s3 = boto.connect_s3()
emr_demo_bucket = s3.create_bucket('bearrito.demos.emr')
emr_demo_bucket.set_acl('private')
input_key = Key(emr_demo_bucket)
input_key.key = "input/0/mapper_input"
input_key.set_contents_from_filename(root_path + 'resources/mapper_input')
mapper_key = Key(emr_demo_bucket)
mapper_key.key = "scripts/bootstrap.sh"
mapper_key.set_contents_from_filename(root_path + 'src/BootStrap.sh')
bootstrap_step = BootstrapAction("bootstrap.sh",'s3://bearrito.demos.emr/scripts/bootstrap.sh',None)
mapper_key.key = "scripts/mapper_nearest_0.py"
mapper_key.set_contents_from_filename(root_path + 'src/EMRNearestMapper0.py')
mapper_key.key = "scripts/mapper_nearest_1.py"
mapper_key.set_contents_from_filename(root_path + 'src/EMRNearestMapper1.py')
reducer_key = Key(emr_demo_bucket)
reducer_key.key = "scripts/reducer_nearest_0.py"
reducer_key.set_contents_from_filename(root_path + 'src/EMRNearestReducer0.py')
reducer_key.key = "scripts/reducer_nearest_1.py"
reducer_key.set_contents_from_filename(root_path + 'src/EMRNearestReducer1.py')
nearest_0 = StreamingStep(name ='EMR First Phase'
,mapper='s3://bearrito.demos.emr/scripts/mapper_nearest_0.py'
,reducer='s3://bearrito.demos.emr/scripts/reducer_nearest_0.py'
,input='s3://bearrito.demos.emr/input/0'
,output='s3://bearrito.demos.emr/output/0')
nearest_1 = StreamingStep(name ='EMR Second Phase'
,mapper='s3://bearrito.demos.emr/scripts/mapper_nearest_1.py'
,reducer='s3://bearrito.demos.emr/scripts/reducer_nearest_1.py'
,input='s3://bearrito.demos.emr/output/0'
,output='s3://bearrito.demos.emr/output/1')
emr = boto.connect_emr()
jobid = emr.run_jobflow(name="EMR Two Phase"
,log_uri='s3://bearrito.demos.logs'
,steps = [nearest_0,nearest_1]
,bootstrap_actions=[bootstrap_step])
status = emr.describe_jobflow(jobid)
Bootstrap.sh
- Need to ensure that we have correct runtime deps
wget http://python.org/ftp/python/2.7.2/Python-2.7.2.tar.bz2
tar jfx Python-2.7.2.tar.bz2
cd Python-2.7.2
./configure --with-threads --enable-shared
make
sudo make install
sudo ln -s /usr/local/lib/libpython2.7.so.1.0 /usr/lib/
sudo ln -s /usr/local/lib/libpython2.7.so /usr/
Break TIME
- Any questions so far?
- We've put in a lot of work so far for not much to show
- Quirky syntax of mapper and reducer
- Creating input and output locations on S3
- Managing jobs directly
- We are going to look at abstractions that make this easier
HIVE Motivation
- Let's backup
- What does the below do in your SQL version
SELECT sum(channel_1), count(channel_1)
FROM sensor_records
GROUPBY machine_name
- It probably groups records in a recordset by a field
- Then computes the sum of a field and counts the number of records
map Reduce Formulation
- FROM sensor_records => Defining the S3 Bucket or HDFS location.
- GROUPBY machine_name => Becomes the MAP phase or emitting records with machine_name as key
- Example => ('Machine-1', .003)
- SELECT => Becomes REDUCE phase with aggregating based on keys
- We've already done this....with much code and overhead
HIVE
- Hive is an abstraction layer on top of HADOOP that allows for SQL like syntax and semantics.
- Hides much of the gory details
- Very configurable but generally works out of the box.
Getting STARTEd
- I'm going to run this in interactive mode.
- Plenty of docs on this. RTFM and start a job.
- Then do.
ssh -i /home/barrett/EC2/MyKey.pem hadoop@ec2-.xxx.xxx.xx.xx.compute.amazonaws.com
Hive CONSOLE
- Start a hive console
hadoop@ec2$ hive
hive>
hive> show tables;
OK
Time Take: 12.02 seocnds
hive>
You should not have any tables yet
PartiTIONS
- I performed a step I didn't show you
- I partitioned my sensor records by machine name
- I have an S3 bucket that looks like :
- Each directory has a file that only has records for that machine name.
Table Creation
- Why have partitions? Hortizontal Partitioning like regular SQL. Now do...
CREATE EXTERNAL TABLE sensor_records (
dq_machine_name string, record_date string, channel_1 float, channel_2 float, channel_3 float)
PARTITIONED BY (machine_name string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 's3://bearrito.demos.emr/input/hive/Sensor/';
Table Creation
- We partition by machine_name. Note that the partition column has the form column=value from our S3 bucket. So since our S3 bucket had machine_name =Machine-1 etc. We must use machine_name as the partition column.
- We indicate the fields are separated the tab char
- And we point the location at the root of partition folders. Now run..
ALTER TABLE sensor_records RECOVER PARTITIONS;
QUeRIES
SELECT COUNT(*) FROM sensor_records;
//These won't launch hadoop jobs.
//No real computation to perform when partitioned.
SELECT * FROM sensor_records LIMIT 10;
SELECT * FROM sensor_records
WHERE machine_name = 'Machine-1' LIMIT 10;
// This will launch jobs. Why is that?
SELECT * FROM sensor_records
WHERE machine_name = 'Machine-1' AND channel_1 < 0.0 LIMIT 10;
//Since we know the distribution of the channels by machine
//we can easily check that aggregation is correct
// Can anyone guess the result set? Think about the math.
SELECT machine_name, AVG(channel_1) FROM sensor_records GROUP BY machine_name;
ResultSet
Pig Motivation
- Different compuation model than HIVE
- Hive is conceptually more SQL like
- PIG is more of a data flow language
- Pig allows for a dataflow pipeline
- Allows for splits in the pipeline
- Reads and Writes more like a programming language so appeals in that way
Starting PIG
- Start an interactive job like with hive
- Logon to master node
$ pig
> pwd
> cd s3://bearrito.demos.emr/input/0
> ls
Loading data
- Import our data
- Describe it
- Illustrate it
SENSOR_RECORDS = LOAD 's3://bearrito.demos.emr/input/0'
>as (machine_name:chararray,record_date:chararray,channel_1:float,channel_2:float,channel_3:floa
Queries
- Grouping
GRP_SR = GROUP SENSOR_RECORDS BY machine_name
AVG_GRP_SR = FOREACH GRP_SR GENERATE group ,AVG(SENSOR_RECORDS.channel_1);
Queries
- Sampling + Filtering
FILTERED_C3_SR = FILTER SENSOR_RECORDS BY channel_3 > 10
SMPL_FLT_C3_SR = SAMPLE FILTERED_C3_SR .10;
Queries
- Bags
- Collapsing values to tuples and expanding back
- Fields aren't necessarily atomic.
TUPLE_SR = FOREACH SENSOR_RECORDS
>GENERATE machine_name,
>TOTUPLE (channel_1,channel_2,channel_3) as channel_tuple;
FLAT_TUPLE_SR = FOREACH TUPLE_SR
>GENERATE machine_name, FLATTEN(channel_tuple) ;
Syntax
- Showing syntax is boring for me and you
- Read the docs at http://pig.apache.org/docs/r0.11.0/basic.html
- It basically reads and writes like SQL
DiAGNOSTIC
- DUMP
- DESCIRBE
- ILLUSTRATE
- EXPLAIN
Whats NEXT
- Come to my next talk!
- Setting up Hadoop and PIG Locally
- Optimizing PIG
- I'll cover User Defined Functions in PIG
- Testing UDF's in PIG
- The coolest thing since sliced bread : SCALDING
pittsburgh nosql _ mapreduce
By bearrito
pittsburgh nosql _ mapreduce
- 3,747