HAdoOP on EMR

Barrett Strausser

@FuncBearrito

Navigation

Main Flow use left and right keys
Code, Configuration and Miscellany use up and down
If you want an overview hit 'ESC'

Lots of MATERIAL

What is MapReduce
Setup on AWS
Basics of Mappers and Reducer by example
Another MapReduce Example
HIVE
PIG

Lots of material. Each topic deserves a more thorough treatment.

Some non-pythonic code. I tried to make things language agnostic

Map Reduce

An algorithm not a framework
Might better be called emit and aggregate
It is about selecting or projecting or MAPPING
And then aggregating or summing or REDUCING
It is functional.
You've seen it before. Who writes python...?
Grok the below nonsense for great learning....

Types:

map (k1,v1) => list((k2,v2))

reduce (k2,list(v2)) => list(v2)

Example - Simple Web Index

Types:

map (url,content) => list(word,url)

reduce (word,list(url)) => list((word,set(url))

Input a webpage model of form (url,content)
Map(url, content) to (word, url) by splitting content into words.
Reduce (word, list(url)) by emitting (word,set(url))

For the visual learners

Hadoop FrameWork

Wait just a minute.... What was that sort phase

Hadoop will GUARANTEE that all tuples with the same key will be reduced by the same reducer instance AND in sorted order

Meaning, you will have everything you need locally to aggregate on a GIVEN KEY

Hadoop WILL NOT GUARANTEE that a reducer receives only one key.

More INDepth

Input reader - reads from a stable source : HDFS, S3, SomeNoSql

Mapper - splits input into (key,value)

Partition Function - Function(key) => Reducer label

Comparison - Sort Phase

Reduce - Applies a function to each input ONCE. May or may not output result.

More visuals

Hadoop

MapReduce + HDFS

HDFS = Hadoop File System

I mostly get by ignoring the details of trackers, job nodes as I'm concerned with productivity than optimal settings

Your mileage may vary

Hadoop

JobTracker to which client applications submit jobs.

JobTracker is rack-aware

TaskTracker - Runs jobs in separate JVM

Data is 3-redundant

Heartbeat between JobTracker and TaskTracker

AWS

I prefer Amazon EMR (Elastic Map Reduce)

My data is already there (S3)

I can use tools I know (boto)

Integrates well with the other AWS products (dynamodb)

Aws Tool Setup

setup is by no means exhaustive
miminal steps needed

sudo vi ~/.bashrc || sudo vi ~/.bash_profile

export EC2_HOME = ...export EC2_PRIVATE_KEY=pk-foo.pemexport EC2_CERT=cert-foo.pem
export EC2_KEYPAIR=...export EC2_URL=...

$sudo apt-get install python,pip$pip install boto

Our TASK

We have (sensors,logs whatever) that output timeseries of records

Each record consist of three different signals (channels)

Our task is to estimate the statistical distribution of the channels given some belief about their distribution

WHY ALL THE MATH? Easy to check your answers

Example Schema

Machine Name = Identifier of Sensor. Ranging from 1-3
Record_Date = Timestamp of sensor reading.
Channel_1 = Normally Distributed N(i,1). i = machine id
Channel_2 = Exponentially Distributed with lambda =3
Channel_3 = Lognormally Distributed LN(0,2)

  Machine-1    2012-11-28 22:42:00   0.1722  0.108   6.2504
  Machine-2    2012-11-28 22:42:00   0.0185  0.3336  3.316
  Machine-3    2012-11-28 22:42:00   1.6843  0.2725  1.3314
  Machine-1    2012-11-28 22:42:00   0.1482  0.1422  0.3965

Model Code

  
records = []
with open('/home/me/Git/EMR-DEMO/code/resources/mapper_input','w+') as f:
    for i in range(10000):
        input = machine_name = " Machine-" + str((i % 3) + 1) + '\t'
        input += str(datetime.datetime.today().replace(second=0, microsecond=0)) + '\t'
        input +=  str(round(random.normalvariate((i % 3),1),4)) + '\t'
        input +=  str(round( random.expovariate(3),4) )+ '\t'
        input +=  str(round(random.lognormvariate(0,2),4)) + '\n'
        f.write(input)

Mapper

Read the files in from STDIN

Split on '\t' to find the key and values

Your input format determines how you parse

Process the values however you wish

Emit a (key, value) pair by printing to STDOUT

Code

  
#!/usr/bin/env python
# encoding: utf-8
import sys
import decimal


def some_function(sensor_record):

    numerical_record = []
    for record in sensor_record:
        scaled = decimal.Decimal(record).quantize(decimal.Decimal('.001'),rounding=decimal.ROUND_DOWN)
        numerical_record.append(scaled)



    return ''.join(map(lambda x: str(x) + ' ',numerical_record))

def record_cleaner(record):
    record = record.strip()
    (key,date,c1,c2,c3) = record.split('\t')
    values = [c1,c2,c3]
    return (key,values)

def process(count,record) :
    count += 1
    (key,values) =  record_cleaner(record)
    transformed_data = some_function(values)
    output = '%s\t%s:%s' % (key,count,transformed_data)
    print output

    return count


count = 0

for record in sys.stdin :
    count = process(count,record)

Reducer

The reducer is very similar to the mapper

You read in from StdIn

You split on '\t' into key and value

You then process the records how you wish

Write the results to StdOut

Reducer WalkThrough

Start reading in records

  
import sys
import decimal
import math

machine_name = None
key = ''
machine = 'foo'
print >> sys.stderr, 'Starting'
for record in sys.stdin:
    if(record != None and record != '') :

        (k,i,v) = record_cleaner(record)

        if(machine_name == None):
            record_count = 0
            running_sum = [0,0,0]
            machine_name = k

        if(machine_name != k):

            estimates = estimate_parameters(record_count,running_sum)
            print '%s\t%s:%s:%s:%s' % (machine_name,estimates[0],estimates[1],estimates[2],estimates[3])

            machine_name = k
            #print >> sys.stderr, 'machine ' + machine_name
            record_count = 0
            running_sum = [0,0,0]
            (record_count,running_sum) = process(record_count,running_sum,record)
        else:
            #print >> sys.stderr, 'processing %s' % record
            (record_count,running_sum) = process(record_count,running_sum,record)



estimates = estimate_parameters(record_count,running_sum)
print '%s\t%s:%s:%s:%s' % (machine_name,estimates[0],estimates[1],estimates[2],estimates[3])

Reducer WalkThrough

Core Logic

  
def process(record_count,running_sum,record):
    record_count += 1
    record = record_cleaner(record)
    data_tuple = record_to_datatuple(record[2])
    running_sum = compute_sum(running_sum,data_tuple)
    t = (record_count,running_sum)
    return t

Reducer WalkThrough

Split into key and value(s) based on file format

  
def record_cleaner(record):
    record = record.strip()
    split_record = record.split('\t')
    key = split_record[0]
    split_record = split_record[1].split(":")
    id = split_record[0]
    value = split_record[1]
    return (key,id,value)

Reducer WalkThrough

Process into a list and do some numerical manipulation

  
def record_to_datatuple(record):
    split_record = record.split(" ")
    numerical_record = []
    for record in split_record:
        scaled = decimal.Decimal(record).quantize(decimal.Decimal('.001'),rounding=decimal.ROUND_DOWN)
        numerical_record.append(scaled)

    return numerical_record

Reducer WalkThrough

Keep a running sum as we iterate

  
def compute_sum(running_sum,data_tuple):
    try:
        running_sum[0] += data_tuple[0]
        running_sum[1] += data_tuple[1]
        running_sum[2] += data_tuple[2]
    except :
        print >> sys.stderr, 'error in running sum with tuple %s' % data_tuple
    return running_sum

Reducer WalkThrough

Finally compute the estimates

  
def estimate_parameters(record_count,running_sum):

    normal_estimate = running_sum[0] / record_count
    exponential_estimate = 1 / (running_sum[1] /record_count)

    log_sum = running_sum[2]

    log_u =  log_sum / record_count

    log_numerator =  log_sum*log_sum - log_sum*log_u + log_u*log_u
    log_sigma = log_numerator / record_count
    return (normal_estimate,exponential_estimate,log_u,log_sigma)

Job Context

Great. Let's burn some dust.

  
import boto
from boto.s3.key import Key
from boto.emr.step import StreamingStep

#you will need your own bucket
root_path = '/home/me/Git/EMR-DEMO/code/'
s3 = boto.connect_s3()
emr_demo_bucket = s3.create_bucket('bearrito.demos.emr')
emr_demo_bucket.set_acl('private')

Job Context

Upload all our scripts

  
json_records = Key(emr_demo_bucket)
json_records.key = "input/hive/mapper_input"
json_records.set_contents_from_filename(root_path + 'resources/mapper_input')

input_key = Key(emr_demo_bucket)
input_key.key = "input/0/mapper_input"
input_key.set_contents_from_filename( root_path + 'resources/mapper_input')

mapper_key = Key(emr_demo_bucket)
mapper_key.key = "scripts/mapper_script.py"
mapper_key.set_contents_from_filename(root_path + 'src/EMRDemoMapper.py')

reducer_key = Key(emr_demo_bucket)
reducer_key.key = "scripts/reducer_script.py"
reducer_key.set_contents_from_filename( root_path +  'src/EMRDemoReducer.py')

Job Context

Run and Monitor the Job.

  
demo_step = StreamingStep(name ='EMR Demo Example'
                    ,mapper='s3://bearrito.demos.emr/scripts/mapper_script.py'
                    ,reducer='s3://bearrito.demos.emr/scripts/reducer_script.py'
                    ,input='s3://bearrito.demos.emr/input/0'
                    ,output='s3://bearrito.demos.emr/output')

emr = boto.connect_emr()
jobid = emr.run_jobflow(name="EMR Example",log_uri='s3://bearrito.demos.logs',steps = [demo_step])

status = emr.describe_jobflow(jobid)

#log into AWS to monitor further

More EXAMPLES!

The last example was focused on aggregation
What about a search example
Given a target record find the machine that generated the record closest to that record

Our approach is going to work in phases

Given inputs, emit a (machine_name, distance) pair
Foreach machine_name emit the smallest distance
Collapse the set of min distance onto a single key
Loop over and emit the smallest.

Mapper Phase 0

Given a target emit a machine name and distance.

  
#!/usr/bin/env python
# encoding: utf-8
import decimal
import math
import sys

def record_to_decimal(sensor_record):
    numerical_record = []
    for record in sensor_record:
        scaled = decimal.Decimal(record).quantize(decimal.Decimal('.001'),rounding=decimal.ROUND_DOWN)
        numerical_record.append(scaled)
    return numerical_record


def record_cleaner(record):
    #print(record)
    record = record.strip()
    (super_key,record_date,c1,c2,c3) = record.split('\t')
    key = super_key[super_key.index('Machine'):]
    values = [c1,c2,c3]
    #print(values)
    return (key,values)

def compute_distance(target,data):
    d0 = (target[0] - data[0])**2
    d1 = (target[1] - data[1])**2
    d2 = (target[2] - data[2])**2

    return math.sqrt(d0 + d1 + d2)

def process(target,record) :
    (key,values) =  record_cleaner(record)
    transformed_data = record_to_decimal(values)
    distance = compute_distance(target,transformed_data)
    output = '%s\t%s' % (key,distance)
    print output


target = (decimal.Decimal(-1.53),decimal.Decimal(0.144),decimal.Decimal(1.99))
for record in sys.stdin:
     process(target,record)

Reducer Phase 0

For a given key(machine) emit the closest record

  
#!/usr/bin/env python
# encoding: utf-8
import decimal
import sys

def process(min_machine,min_distance,record) :
    (key,value) = record.split("\t")
    data = decimal.Decimal(value).quantize(decimal.Decimal('.001'),rounding=decimal.ROUND_DOWN)
    if(min_machine == None) :
        min_machine = key
        min_distance = None

    if(min_machine != key) :
        print "%s\t%s" % (min_machine,min_distance)
        min_machine = key
        min_distance = None

    if(min_distance == None or data < min_distance) :
        min_distance = data

    return (min_machine,min_distance)


min_distance = None
min_machine = None
for record in sys.stdin :
    (min_machine,min_distance) = process(min_machine,min_distance,record)
print "%s\t%s" % (min_machine,min_distance)

Mapper Phase 1

Given all the min keys, project them onto a common key.

  
#!/usr/bin/env python
# encoding: utf-8
import sys
for record in sys.stdin :
     (key,value) = record.split('\t')
     print "%s\t%s:%s" % ('agg',key,value)

Reducer Phase 1

Loop over all min keys and emit the smallest


#!/usr/bin/env python
# encoding: utf-8
import decimal
import sys


def process(min_machine,min,record) :
    try :

        (key,value) = record.split('\t')
        (machine,distance) = value.split(':')
        data = decimal.Decimal(distance).quantize(decimal.Decimal('.001'),rounding=decimal.ROUND_DOWN)

        if(min == None or data < min) :
            min = data
            min_machine = machine
    except :
        a = 1
    return (min_machine,min)


min = None
min_machine = None

for record in sys.stdin :
    if(record != None or record != ""):
        (min_machine,min) = process(min_machine,min,record)

print "%s\t%s" % (min_machine,min)

Job Setup

Setup our code
Notice the boostrap step

  
#!/usr/bin/env python
# encoding: utf-8
import boto
from boto.s3.key import Key
from boto.emr.step import StreamingStep
from boto.emr.bootstrap_action import BootstrapAction

root_path = '/home/barrett/Git/EMR-DEMO/code/'


s3 = boto.connect_s3()
emr_demo_bucket = s3.create_bucket('bearrito.demos.emr')
emr_demo_bucket.set_acl('private')



input_key = Key(emr_demo_bucket)
input_key.key = "input/0/mapper_input"
input_key.set_contents_from_filename(root_path + 'resources/mapper_input')

mapper_key = Key(emr_demo_bucket)


mapper_key.key = "scripts/bootstrap.sh"
mapper_key.set_contents_from_filename(root_path + 'src/BootStrap.sh')

bootstrap_step = BootstrapAction("bootstrap.sh",'s3://bearrito.demos.emr/scripts/bootstrap.sh',None)


mapper_key.key = "scripts/mapper_nearest_0.py"
mapper_key.set_contents_from_filename(root_path + 'src/EMRNearestMapper0.py')



mapper_key.key = "scripts/mapper_nearest_1.py"
mapper_key.set_contents_from_filename(root_path + 'src/EMRNearestMapper1.py')

reducer_key = Key(emr_demo_bucket)
reducer_key.key = "scripts/reducer_nearest_0.py"
reducer_key.set_contents_from_filename(root_path + 'src/EMRNearestReducer0.py')

reducer_key.key = "scripts/reducer_nearest_1.py"
reducer_key.set_contents_from_filename(root_path + 'src/EMRNearestReducer1.py')



nearest_0 = StreamingStep(name ='EMR First Phase'
    ,mapper='s3://bearrito.demos.emr/scripts/mapper_nearest_0.py'
    ,reducer='s3://bearrito.demos.emr/scripts/reducer_nearest_0.py'
    ,input='s3://bearrito.demos.emr/input/0'
    ,output='s3://bearrito.demos.emr/output/0')

nearest_1 = StreamingStep(name ='EMR Second Phase'
    ,mapper='s3://bearrito.demos.emr/scripts/mapper_nearest_1.py'
    ,reducer='s3://bearrito.demos.emr/scripts/reducer_nearest_1.py'
    ,input='s3://bearrito.demos.emr/output/0'
    ,output='s3://bearrito.demos.emr/output/1')

emr = boto.connect_emr()
jobid = emr.run_jobflow(name="EMR Two Phase"
                        ,log_uri='s3://bearrito.demos.logs'
                        ,steps = [nearest_0,nearest_1]
                        ,bootstrap_actions=[bootstrap_step])

status = emr.describe_jobflow(jobid)

Bootstrap.sh

Need to ensure that we have correct runtime deps

  
wget http://python.org/ftp/python/2.7.2/Python-2.7.2.tar.bz2
tar jfx Python-2.7.2.tar.bz2
cd Python-2.7.2
./configure --with-threads --enable-shared
make
sudo make install
sudo ln -s /usr/local/lib/libpython2.7.so.1.0 /usr/lib/
sudo ln -s /usr/local/lib/libpython2.7.so /usr/

Break TIME

Any questions so far?

We've put in a lot of work so far for not much to show

Quirky syntax of mapper and reducer
Creating input and output locations on S3
Managing jobs directly

We are going to look at abstractions that make this easier

HIVE Motivation

Let's backup
What does the below do in your SQL version

  
SELECT sum(channel_1), count(channel_1)
FROM sensor_records
GROUPBY machine_name

It probably groups records in a recordset by a field
Then computes the sum of a field and counts the number of records

map Reduce Formulation

FROM sensor_records => Defining the S3 Bucket or HDFS location.
GROUPBY machine_name => Becomes the MAP phase or emitting records with machine_name as key
Example => ('Machine-1', .003)
SELECT => Becomes REDUCE phase with aggregating based on keys

We've already done this....with much code and overhead

HIVE

Hive is an abstraction layer on top of HADOOP that allows for SQL like syntax and semantics.

Hides much of the gory details

Very configurable but generally works out of the box.

Getting STARTEd

I'm going to run this in interactive mode.
Plenty of docs on this. RTFM and start a job.
Then do.

  
ssh -i /home/barrett/EC2/MyKey.pem hadoop@ec2-.xxx.xxx.xx.xx.compute.amazonaws.com

Make sure your key is CHMOD 600 and you use the master IP

Hive CONSOLE

Start a hive console


        hadoop@ec2$ hive
        hive>
        hive> show tables;
        OK
        Time Take: 12.02 seocnds
        hive>

You should not have any tables yet

PartiTIONS

I performed a step I didn't show you
I partitioned my sensor records by machine name
I have an S3 bucket that looks like :
Each directory has a file that only has records for that machine name.

Table Creation

Why have partitions? Hortizontal Partitioning like regular SQL. Now do...

  
CREATE EXTERNAL TABLE sensor_records (
dq_machine_name string, record_date string, channel_1 float, channel_2 float, channel_3 float)
PARTITIONED BY (machine_name string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 's3://bearrito.demos.emr/input/hive/Sensor/';

This creates a table with the given schema.

You can use most of the atomic types you are familiar with

Table Creation

We partition by machine_name. Note that the partition column has the form column=value from our S3 bucket. So since our S3 bucket had machine_name =Machine-1 etc. We must use machine_name as the partition column.
We indicate the fields are separated the tab char
And we point the location at the root of partition folders. Now run..

  ALTER TABLE sensor_records RECOVER PARTITIONS;

QUeRIES

  
SELECT COUNT(*) FROM sensor_records;
//These won't launch hadoop jobs. 
//No real computation to perform when partitioned.
SELECT * FROM sensor_records LIMIT 10;

SELECT * FROM sensor_records  
WHERE machine_name = 'Machine-1' LIMIT 10;

// This will launch jobs. Why is that?
SELECT * FROM sensor_records  
WHERE machine_name = 'Machine-1' AND channel_1 < 0.0  LIMIT 10;

//Since we know the distribution of the channels by machine 
//we can easily check that aggregation is correct
// Can anyone guess the result set? Think about the math.
SELECT machine_name, AVG(channel_1) FROM sensor_records  GROUP BY machine_name;

ResultSet

Pig Motivation

Different compuation model than HIVE
Hive is conceptually more SQL like
PIG is more of a data flow language

Pig allows for a dataflow pipeline
Allows for splits in the pipeline
Reads and Writes more like a programming language so appeals in that way

Starting PIG

Start an interactive job like with hive
Logon to master node

$ pig
> pwd
> cd s3://bearrito.demos.emr/input/0
> ls

Start a session

Look around

Loading data

Import our data
Describe it
Illustrate it



SENSOR_RECORDS = LOAD 's3://bearrito.demos.emr/input/0' 
>as (machine_name:chararray,record_date:chararray,channel_1:float,channel_2:float,channel_3:floa

You have the primitive types you would expect

You haven't evaluated any statements yet

Only defined the flow

Queries

Grouping



GRP_SR = GROUP SENSOR_RECORDS BY machine_name

AVG_GRP_SR = FOREACH GRP_SR GENERATE group ,AVG(SENSOR_RECORDS.channel_1);

Notice the use of the SENSOR_RECORDS bag in the last command

You sometimes need to rely on 'illustrate tbl_name'

Note the use of the FOREACH...GENERATE

This is a key construct

Everything is still lazy

Queries

Sampling + Filtering


FILTERED_C3_SR = FILTER SENSOR_RECORDS BY channel_3 > 10

SMPL_FLT_C3_SR = SAMPLE FILTERED_C3_SR .10;

Syntax is usually VERB RELATION PREDICATE

Queries

Bags
Collapsing values to tuples and expanding back
Fields aren't necessarily atomic.


TUPLE_SR = FOREACH SENSOR_RECORDS 
>GENERATE machine_name, 
>TOTUPLE (channel_1,channel_2,channel_3) as channel_tuple;

FLAT_TUPLE_SR = FOREACH TUPLE_SR 
>GENERATE machine_name, FLATTEN(channel_tuple) ;

Syntax

Showing syntax is boring for me and you
Read the docs at http://pig.apache.org/docs/r0.11.0/basic.html
It basically reads and writes like SQL

DiAGNOSTIC

DUMP

DESCIRBE

ILLUSTRATE

EXPLAIN

Whats NEXT

Come to my next talk!

Setting up Hadoop and PIG Locally

Optimizing PIG

I'll cover User Defined Functions in PIG

Testing UDF's in PIG

The coolest thing since sliced bread : SCALDING