DevELOPING WITH PIG


Barrett Strausser

@FuncBearrito

Miscellany

  • Git repo at : https://github.com/bearrito/scala-pig


  • Hit your esc key to get the talk overview


  • Make sure to notice you can scroll down


  • If you want to create your own local hadoop/pig setup you can use my chef recipe at : https://github.com/bearrito/muhdata

TALK Assumptions.

  • Some level of Hadoop.
  • You  understand mappers, reducers and combiners.
  • Some SQL knowledge so phrases  like : Join, Groups and Aggregations make sense.
  • Maybe a bit of java.


  • A way to run Hadoop jobs
  • Could be a local Hadoop setup.
  • Could be up on Elastic Map Reduce.
  • I can help you with either.
  • Some way to run PIG code.

Why PIG (for me)?

  • Abstraction is the one true way. 


  • Too much overhead in writing java code


  • I like the DAG capabilities of PIG compared to HIVE.


  • I wanted to get a baseline understanding of PIG vs something like Scalding in terms of pure productivity.

MY GOALS of This TALK

  • Write PIG LATIN


  • Debug PIG scripts


  • Write PIG UDF's


  • Unit Test PIG UDF's


  • Discuss Scalding

Our TASK

  • We have (sensors,logs whatever) that output timeseries of records


  • Each record consist of three different signals (channels)


  • Our task is to estimate the statistical distribution of the channels given some belief about their distribution


  • WHY ALL THE MATH? Easy to check your answers




Pig Motivation

  • Different compuation model than HIVE
  • Hive is conceptually more SQL like
  • PIG is more of a data flow language
  • If you were smart you could talk about PIG as a monad or category or something


  • Allows for splits in the pipeline
  • Can do snapshots of data.
  • Can hook in developer code.
  • Reads and Writes more like a programming language so appeals in that way


Example Schema

  • Machine Name = Identifier of Sensor. Ranging from 1-3
  • Record_Date = Timestamp of sensor reading.
  • Channel_1 = Normally Distributed N(i,1). i = machine id
  • Channel_2 = Exponentially Distributed with lambda =3
  • Channel_3 = Lognormally Distributed LN(0,2)

  Machine-1    2012-11-28 22:42:00   0.1722  0.108   6.2504
  Machine-2    2012-11-28 22:42:00   0.0185  0.3336  3.316
  Machine-3    2012-11-28 22:42:00   1.6843  0.2725  1.3314
  Machine-1    2012-11-28 22:42:00   0.1482  0.1422  0.3965

First STEPS

$ /opt/pig/bin/pig
> pwd
> ls
> copyFromLocal /home/hduser/mapper_input /user/hduser/mapper_input
> sensors = load '/user/hduser/mapper_input' AS (machine_name:chararray,record_date:chararray,channel_1:float,channel_2:float,channel_3:float);
> DESCRIBE sensors;
> ILLUSTRATE sensors;
  • The LOAD command well... it loads. We will see it later.
  • The operator  AS means to interpret each line according to the given schema. You have most of the Java primitives you would expect.
  • The DESCRIBE gives you the schema representation of the relation (bag of tuples)
  • The ILLUSTRATE performs some computation to give you a sample. Similar to Select * FROM sensors Limit 0,1

LOADING

  • Currently able to load from :
  • HDFS (of course)
  • S3
  • CASSANDRA
  • I think dynamodb ...
  • HBase


  • Has support for parsing:
  • PigStorage - Supports differente delimiters and schema
  • Text - Every tuple is a string
  • json 
  • avro
  • Custom UDFs

Queries

  • DML
> positive_channel_1 = FILTER sensors BY channel_1 > 0;
> ILLUSTRATE postive_channel_1;

> just_channel_2 = FOREACH sensors GENERATE channel_2;
> ILLUSTRATE just_channel_2;

> large_channel_3 = ORDER sensors BY channel_3 DESC;
> top_ten_channel_3 = LIMIT large_channel_3 10;
> subset = SAMPLE sensors .001;
> DUMP subset;
  • FILTER works by accepting a predicate
  • ORDER and LIMIT work as expected
  • SAMPLE works by selecting a random tuple with probability p, in this case .001. This tuple is placed in a new relation

WORKFLOW

  • DML
> SPLIT sensors INTO machine_three IF machine_name==' Machine-3', machine_two IF machine_name==' Machine-2', machine_one IF (machine_name == ' Machine-1'); 
> ... computation on each partition
> biased_sample = UNION ONSCHEMA (UNION ONSCHEMA (SAMPLE machine_one .01) , (SAMPLE machine_two .05)) , (SAMPLE machine_three .10);
> STORE biased_sample INTO 'biased_sample_checkpoint';
  • SPLIT works by partitioning a single relation into one or more relations. 
  • Don't use SPLITS in place of FILTERS
  • We now have a graph of relations. We just as easily could have loaded multiple relations.
  • We project the multiple relations onto a single using the UNION. The ONSCHEMA performs the union by field name and not position.
  • We then STORE the data  as a checkpoint

Queries

  • Grouping 
>sensor_group = GROUP sensors BY machine_name
>channel_one_sum_and_avg = FOREACH sensor_group GENERATE group ,AVG(sensors.channel_1),SUM(sensors.channel_1);
  • Fairly straightforward we GROUP BY a predicate


  • We can then use the usual assortment of aggregration operators,such as MIN,MAX,SUM,AVG ,on the group


  • Note that we must "dot in" as the group is a tuple with an inner bag.

OPERATORS WE MISSeD

  • DIFF - Computes a BAG (set) difference


  • COGROUP - Makes Hadoop cry :{


  • DISTINCT


  • String operators you would expect - indexOf, substring,split..


  • Math operators you would expect - logs, trig functions...

UDF

  • Why, well we need to extend pig


  • What is a UDF  in Pig?


  • They are  Java  classes that implement a certain interface


  • Python is possible but I've never done it.


  • They receive tuples and output values


  • They can be arbitrarily complex

Kinds of UDFS

  • Eval UDF - Take a tuple gives a tuple. 


  • Filter UDF - Takes a tuple returns a boolean.


  • Eval UDF's  implementing ALGEBRAIC interface


  • Eval UDF's  implementing  ACCUMULATOR interface


  • STORE Funcs


  • Load Funcs




EVAL Funcs

  • All evaluation functions extend the Java class org.apache.pig.EvalFunc. 


  • This class uses Java generics. 


  • It is parameterized by the return type of your UDF.


  • As input it takes a tuple, which contains all of the fields the script passes to your UDF. 


  • It returns the type by which you parameterized EvalFunc.

DATE TO MILLIS

  • Have a look at the udf barrett.udf.DateToMillis


  • This class extends EvalFunc and its purpose is to transform some datetime formats into a long value.


  • Let's look at the code....


  • The exec method is thin


  • Tried to push everything into other classes where I don't deal with the interfaces

unit Testing DateTOMILLIS

  • Do as much as I can in true unit test style


  • Push much of the testing onto the companion object


  • In tests like "dateToMilli udf parses date correctly",  I pass in a Tuple to the UDF and assert on the return value.


  • I feel like this is the largest test that still maintains the UDF as the SUT.


  • Exploit Databags and PigTuples


Outputting Schema

  • What if we want to output something more complex?


  • Emitting whole rows for instance?


  • When your UDF returns a bag or a tuple you will
    need to implement outputSchema if you want Pig to understand the contents of that bag
    or tuple.

SCHEMAS

  • Schema is an array of FieldSchema


  • FieldSchema is a 3-tuple of (Alias,Type,Schema)


  • Alias is the field name.


  • Type is the Pig DataType


  • Notice Schema/FieldSchema is recursive. This is how non-atomic values are achieved.

WEATHER PARSER

  • Wanted to look at lots of NOAA Data - 14GB uncompressed


  • Required to take the input data with too many fields and reduce it down to just the required fields


  • barrett.udf.SimpleWeatherParser...


  • Specs in 'VerifyWeatherRecordsAreParsed'


AlgebRAIC

  • What if we want to implement an aggregative function like MAX, SUM or AVG ?
  • Implement the ALGEBRAIC interface like in barrett.udf.GeometricMean


  •  Algebraic functions work in three stages
  • Initial - Called with a Bag with a single Tuple during the map phase
  • Intermediate - Called with a Bag with multiple Tuples during the combine phase
  • Final - Called with a Bag with multiple Tuples  emitted from the combine phase. Typically wraps all the values up in some way.

LOAD FUNC

  • Pig's load function is built on top of a  InputFormat, the class which Hadoop uses to read data.


  • InputFormat serves two functions: it determines how input will be split between map tasks, and it provides a RecordReader that produces key value pairs as input to those map tasks. 


  • The load function takes these key value pairs and returns a Pig Tuple.

LOAD PROCESS

  •  Get input format - defines how to read data from a file into the Mapper instances.


  • Find the storage location via the setLocation method- This is where you can expose datasources other than HDFS. We've already seen numerous examples.


  • Casting Function - ByteArray to Types - Needed if your storage is keeping the PigTuples at ByteArrays.


  • Reading Records -  Use the recordReader to iterate over records. Terminates when null is encountered.



BONUS

  • Add partitions if your storage mechanism supports it. I know S3 does as well as HBase.


  • In order for Pig to request the relevant partitions, it must know how the data is partitioned. Pig determines this by calling getPartitionKeys.


  • Add schema meta-data. Supports removing fields names from load statements. Also is used in partitioning.


STORE FUNC

  • Basically the inverse of LOAD


  • The STORE/LOAD functions are critically important.


  • Unfortunately the code is depressingly boiler-plate so I didn't implement in Scala.


  • You can find plenty of examples in the PIG SVN Repo


  • This is a good reference as well -> http://developer.yahoo.com/hadoop/tutorial/module5.html#inputformat




PIG UNIT

  • A framework for "unit testing" pig scripts


  • Lets run over the specs in UDFSpecs to see the options...


  • If it wasn't clear asserts are based on text diffs


  • You can pass in existing pig scripts


  • You can pass in output and input as a file


inTegration testing datetomillis

  • Look in barrett.udf.UDFSpecs
  • Look at the test titled "VerifyDateToMillisParser"
  • Ignore the args val for now.
  • Notice how the script val looks like a pig script!
  • Our input val is going to be aliased into the datesAsYYYYMMDD relation, basically injecting it into the script.  
  • When this is executed a local hadoop cluster will be initialized and the script run.
  • The dateToMillis relation will then be compared against our output val

Maybe a workflow for AnalystS

  • I'm a bit overzealous about CI and having all the business in source and testable


  • But some (alot?) of Hadoop users aren't developers, so writing JUNIT tests might not work.


  • This is what PIG was even developed for (analysts at yahoo)


  • How can we enable analysts to have their script verified by CI facilities?


  • You could step outside of JUnit but then you have test results scattered everywhere.


  • My thought was checkin :

My ThoughtS

  • Checkin : the pig script, input and output into source.


  • Use code generation to create individual pig unit tests for each pig triple (script,input, output)


  • Real reason : I want to play with Scala Macros

TiME FOR SOME NEXT LEVEL STUFF

  • I want to do iteration
  • I want to solve this problem : clustering
  • Its a common machine learning task
  • I don't want to use Mahout



Scalding

  • Scalding is an API wrapper for Cascading


  • Cascading is an abstraction of Hadoop details


  • You write your code in Scala and it is translated into the proper stuff..... maybe


  • Seems much easier to create complex workflows and associated user defined functions


Word COunt Example

  • First example is the older Fields based version


  • I didn't play with this much as I'm most interested in static typing


  • TypedWordCount displays the power a bit more


  • Try dotting in on the val groups you'll see that you can express quite a bit in a few lines


  • Let's see how many lines it takes us to write an EvalFunction...

Clustering

  • First, I get to create a rather extensive domain model in my native language


  • I'm transparently creating udfs as I write my object and class methods


  • I'm able to confirm through the compiler that my iteration will behave correctly.  This is through @tailrec call 


  • The overhead for me to do this in PIG and certainly in Java would probably be too high


DONE

Thats it!

pig

By bearrito

pig

  • 2,140