DevELOPING WITH PIG

Barrett Strausser

@FuncBearrito

Miscellany

Git repo at : https://github.com/bearrito/scala-pig

Hit your esc key to get the talk overview

Make sure to notice you can scroll down

If you want to create your own local hadoop/pig setup you can use my chef recipe at : https://github.com/bearrito/muhdata

TALK Assumptions.

Some level of Hadoop.
You understand mappers, reducers and combiners.
Some SQL knowledge so phrases like : Join, Groups and Aggregations make sense.
Maybe a bit of java.

A way to run Hadoop jobs
Could be a local Hadoop setup.
Could be up on Elastic Map Reduce.
I can help you with either.
Some way to run PIG code.

Why PIG (for me)?

Abstraction is the one true way.

Too much overhead in writing java code

I like the DAG capabilities of PIG compared to HIVE.

I wanted to get a baseline understanding of PIG vs something like Scalding in terms of pure productivity.

MY GOALS of This TALK

Write PIG LATIN

Debug PIG scripts

Write PIG UDF's

Unit Test PIG UDF's

Discuss Scalding

Our TASK

We have (sensors,logs whatever) that output timeseries of records

Each record consist of three different signals (channels)

Our task is to estimate the statistical distribution of the channels given some belief about their distribution

WHY ALL THE MATH? Easy to check your answers

Pig Motivation

Different compuation model than HIVE
Hive is conceptually more SQL like
PIG is more of a data flow language
If you were smart you could talk about PIG as a monad or category or something

Allows for splits in the pipeline
Can do snapshots of data.
Can hook in developer code.
Reads and Writes more like a programming language so appeals in that way

Example Schema

Machine Name = Identifier of Sensor. Ranging from 1-3
Record_Date = Timestamp of sensor reading.
Channel_1 = Normally Distributed N(i,1). i = machine id
Channel_2 = Exponentially Distributed with lambda =3
Channel_3 = Lognormally Distributed LN(0,2)

  Machine-1    2012-11-28 22:42:00   0.1722  0.108   6.2504
  Machine-2    2012-11-28 22:42:00   0.0185  0.3336  3.316
  Machine-3    2012-11-28 22:42:00   1.6843  0.2725  1.3314
  Machine-1    2012-11-28 22:42:00   0.1482  0.1422  0.3965

First STEPS

$ /opt/pig/bin/pig
> pwd
> ls
> copyFromLocal /home/hduser/mapper_input /user/hduser/mapper_input
> sensors = load '/user/hduser/mapper_input' AS (machine_name:chararray,record_date:chararray,channel_1:float,channel_2:float,channel_3:float);
> DESCRIBE sensors;
> ILLUSTRATE sensors;

The LOAD command well... it loads. We will see it later.
The operator AS means to interpret each line according to the given schema. You have most of the Java primitives you would expect.
The DESCRIBE gives you the schema representation of the relation (bag of tuples)
The ILLUSTRATE performs some computation to give you a sample. Similar to Select * FROM sensors Limit 0,1

LOADING

Currently able to load from :
HDFS (of course)
S3
CASSANDRA
I think dynamodb ...
HBase

Has support for parsing:
PigStorage - Supports differente delimiters and schema
Text - Every tuple is a string
json
avro
Custom UDFs

Queries

> positive_channel_1 = FILTER sensors BY channel_1 > 0;
> ILLUSTRATE postive_channel_1;

> just_channel_2 = FOREACH sensors GENERATE channel_2;
> ILLUSTRATE just_channel_2;

> large_channel_3 = ORDER sensors BY channel_3 DESC;
> top_ten_channel_3 = LIMIT large_channel_3 10;
> subset = SAMPLE sensors .001;
> DUMP subset;

FILTER works by accepting a predicate
ORDER and LIMIT work as expected
SAMPLE works by selecting a random tuple with probability p, in this case .001. This tuple is placed in a new relation

WORKFLOW

> SPLIT sensors INTO machine_three IF machine_name==' Machine-3', machine_two IF machine_name==' Machine-2', machine_one IF (machine_name == ' Machine-1'); 
> ... computation on each partition
> biased_sample = UNION ONSCHEMA (UNION ONSCHEMA (SAMPLE machine_one .01) , (SAMPLE machine_two .05)) , (SAMPLE machine_three .10); 
> STORE biased_sample INTO 'biased_sample_checkpoint';

SPLIT works by partitioning a single relation into one or more relations.
Don't use SPLITS in place of FILTERS
We now have a graph of relations. We just as easily could have loaded multiple relations.
We project the multiple relations onto a single using the UNION. The ONSCHEMA performs the union by field name and not position.
We then STORE the data as a checkpoint

Queries

Grouping

>sensor_group = GROUP sensors BY machine_name
>channel_one_sum_and_avg = FOREACH sensor_group GENERATE group ,AVG(sensors.channel_1),SUM(sensors.channel_1);

Fairly straightforward we GROUP BY a predicate

We can then use the usual assortment of aggregration operators,such as MIN,MAX,SUM,AVG ,on the group

Note that we must "dot in" as the group is a tuple with an inner bag.

OPERATORS WE MISSeD

DIFF - Computes a BAG (set) difference

COGROUP - Makes Hadoop cry :{

DISTINCT

String operators you would expect - indexOf, substring,split..

Math operators you would expect - logs, trig functions...

UDF

Why, well we need to extend pig

What is a UDF in Pig?

They are Java classes that implement a certain interface

Python is possible but I've never done it.

They receive tuples and output values

They can be arbitrarily complex

Kinds of UDFS

Eval UDF - Take a tuple gives a tuple.

Filter UDF - Takes a tuple returns a boolean.

Eval UDF's implementing ALGEBRAIC interface

Eval UDF's implementing ACCUMULATOR interface

STORE Funcs

Load Funcs

EVAL Funcs

All evaluation functions extend the Java class org.apache.pig.EvalFunc.

This class uses Java generics.

It is parameterized by the return type of your UDF.

As input it takes a tuple, which contains all of the fields the script passes to your UDF.

It returns the type by which you parameterized EvalFunc.

DATE TO MILLIS

Have a look at the udf barrett.udf.DateToMillis

This class extends EvalFunc and its purpose is to transform some datetime formats into a long value.

Let's look at the code....

The exec method is thin

Tried to push everything into other classes where I don't deal with the interfaces

unit Testing DateTOMILLIS

Do as much as I can in true unit test style

Push much of the testing onto the companion object

In tests like "dateToMilli udf parses date correctly", I pass in a Tuple to the UDF and assert on the return value.

I feel like this is the largest test that still maintains the UDF as the SUT.

Exploit Databags and PigTuples

Outputting Schema

What if we want to output something more complex?

Emitting whole rows for instance?

When your UDF returns a bag or a tuple you will
need to implement outputSchema if you want Pig to understand the contents of that bag
or tuple.

SCHEMAS

Schema is an array of FieldSchema

FieldSchema is a 3-tuple of (Alias,Type,Schema)

Alias is the field name.

Type is the Pig DataType

Notice Schema/FieldSchema is recursive. This is how non-atomic values are achieved.

WEATHER PARSER

Wanted to look at lots of NOAA Data - 14GB uncompressed

Required to take the input data with too many fields and reduce it down to just the required fields

barrett.udf.SimpleWeatherParser...

Specs in 'VerifyWeatherRecordsAreParsed'

AlgebRAIC

What if we want to implement an aggregative function like MAX, SUM or AVG ?
Implement the ALGEBRAIC interface like in barrett.udf.GeometricMean

Algebraic functions work in three stages
Initial - Called with a Bag with a single Tuple during the map phase
Intermediate - Called with a Bag with multiple Tuples during the combine phase
Final - Called with a Bag with multiple Tuples emitted from the combine phase. Typically wraps all the values up in some way.

LOAD FUNC

Pig's load function is built on top of a InputFormat, the class which Hadoop uses to read data.

InputFormat serves two functions: it determines how input will be split between map tasks, and it provides a RecordReader that produces key value pairs as input to those map tasks.

The load function takes these key value pairs and returns a Pig Tuple.

LOAD PROCESS

Get input format - defines how to read data from a file into the Mapper instances.

Find the storage location via the setLocation method- This is where you can expose datasources other than HDFS. We've already seen numerous examples.

Casting Function - ByteArray to Types - Needed if your storage is keeping the PigTuples at ByteArrays.

Reading Records - Use the recordReader to iterate over records. Terminates when null is encountered.

BONUS

Add partitions if your storage mechanism supports it. I know S3 does as well as HBase.

In order for Pig to request the relevant partitions, it must know how the data is partitioned. Pig determines this by calling getPartitionKeys.

Add schema meta-data. Supports removing fields names from load statements. Also is used in partitioning.

STORE FUNC

Basically the inverse of LOAD

The STORE/LOAD functions are critically important.

Unfortunately the code is depressingly boiler-plate so I didn't implement in Scala.

You can find plenty of examples in the PIG SVN Repo

This is a good reference as well -> http://developer.yahoo.com/hadoop/tutorial/module5.html#inputformat

PIG UNIT

A framework for "unit testing" pig scripts

Lets run over the specs in UDFSpecs to see the options...

If it wasn't clear asserts are based on text diffs

You can pass in existing pig scripts

You can pass in output and input as a file

inTegration testing datetomillis

Look in barrett.udf.UDFSpecs
Look at the test titled "VerifyDateToMillisParser"
Ignore the args val for now.
Notice how the script val looks like a pig script!
Our input val is going to be aliased into the datesAsYYYYMMDD relation, basically injecting it into the script.
When this is executed a local hadoop cluster will be initialized and the script run.
The dateToMillis relation will then be compared against our output val

Maybe a workflow for AnalystS

I'm a bit overzealous about CI and having all the business in source and testable

But some (alot?) of Hadoop users aren't developers, so writing JUNIT tests might not work.

This is what PIG was even developed for (analysts at yahoo)

How can we enable analysts to have their script verified by CI facilities?

You could step outside of JUnit but then you have test results scattered everywhere.

My thought was checkin :

My ThoughtS

Checkin : the pig script, input and output into source.

Use code generation to create individual pig unit tests for each pig triple (script,input, output)

Real reason : I want to play with Scala Macros

TiME FOR SOME NEXT LEVEL STUFF

I want to do iteration
I want to solve this problem : clustering
Its a common machine learning task
I don't want to use Mahout

Scalding

Scalding is an API wrapper for Cascading

Cascading is an abstraction of Hadoop details

You write your code in Scala and it is translated into the proper stuff..... maybe

Seems much easier to create complex workflows and associated user defined functions

Word COunt Example

First example is the older Fields based version

I didn't play with this much as I'm most interested in static typing

TypedWordCount displays the power a bit more

Try dotting in on the val groups you'll see that you can express quite a bit in a few lines

Let's see how many lines it takes us to write an EvalFunction...

Clustering

First, I get to create a rather extensive domain model in my native language

I'm transparently creating udfs as I write my object and class methods

I'm able to confirm through the compiler that my iteration will behave correctly. This is through @tailrec call

The overhead for me to do this in PIG and certainly in Java would probably be too high

DONE

Thats it!