DevELOPING WITH PIG
Barrett Strausser
@FuncBearrito
Miscellany
- Git repo at : https://github.com/bearrito/scala-pig
- Hit your esc key to get the talk overview
- Make sure to notice you can scroll down
- If you want to create your own local hadoop/pig setup you can use my chef recipe at : https://github.com/bearrito/muhdata
TALK Assumptions.
- Some level of Hadoop.
- You understand mappers, reducers and combiners.
- Some SQL knowledge so phrases like : Join, Groups and Aggregations make sense.
- Maybe a bit of java.
- A way to run Hadoop jobs
- Could be a local Hadoop setup.
- Could be up on Elastic Map Reduce.
- I can help you with either.
- Some way to run PIG code.
Why PIG (for me)?
- Abstraction is the one true way.
- Too much overhead in writing java code
- I like the DAG capabilities of PIG compared to HIVE.
- I wanted to get a baseline understanding of PIG vs something like Scalding in terms of pure productivity.
MY GOALS of This TALK
- Write PIG LATIN
- Debug PIG scripts
- Write PIG UDF's
- Unit Test PIG UDF's
- Discuss Scalding
Our TASK
- We have (sensors,logs whatever) that output timeseries of records
- Each record consist of three different signals (channels)
- Our task is to estimate the statistical distribution of the channels given some belief about their distribution
- WHY ALL THE MATH? Easy to check your answers
Pig Motivation
- Different compuation model than HIVE
- Hive is conceptually more SQL like
- PIG is more of a data flow language
- If you were smart you could talk about PIG as a monad or category or something
- Allows for splits in the pipeline
- Can do snapshots of data.
- Can hook in developer code.
- Reads and Writes more like a programming language so appeals in that way
Example Schema
- Machine Name = Identifier of Sensor. Ranging from 1-3
- Record_Date = Timestamp of sensor reading.
- Channel_1 = Normally Distributed N(i,1). i = machine id
- Channel_2 = Exponentially Distributed with lambda =3
- Channel_3 = Lognormally Distributed LN(0,2)
Machine-1 2012-11-28 22:42:00 0.1722 0.108 6.2504
Machine-2 2012-11-28 22:42:00 0.0185 0.3336 3.316
Machine-3 2012-11-28 22:42:00 1.6843 0.2725 1.3314
Machine-1 2012-11-28 22:42:00 0.1482 0.1422 0.3965
First STEPS
$ /opt/pig/bin/pig
> pwd
> ls
> copyFromLocal /home/hduser/mapper_input /user/hduser/mapper_input
> sensors = load '/user/hduser/mapper_input' AS (machine_name:chararray,record_date:chararray,channel_1:float,channel_2:float,channel_3:float);
> DESCRIBE sensors;
> ILLUSTRATE sensors;
- The LOAD command well... it loads. We will see it later.
- The operator AS means to interpret each line according to the given schema. You have most of the Java primitives you would expect.
- The DESCRIBE gives you the schema representation of the relation (bag of tuples)
- The ILLUSTRATE performs some computation to give you a sample. Similar to Select * FROM sensors Limit 0,1
LOADING
- Currently able to load from :
- HDFS (of course)
- S3
- CASSANDRA
- I think dynamodb ...
- HBase
- Has support for parsing:
- PigStorage - Supports differente delimiters and schema
- Text - Every tuple is a string
- json
- avro
- Custom UDFs
Queries
- DML
> positive_channel_1 = FILTER sensors BY channel_1 > 0;
> ILLUSTRATE postive_channel_1;
> just_channel_2 = FOREACH sensors GENERATE channel_2;
> ILLUSTRATE just_channel_2;
> large_channel_3 = ORDER sensors BY channel_3 DESC;
> top_ten_channel_3 = LIMIT large_channel_3 10;
> subset = SAMPLE sensors .001;
> DUMP subset;
- FILTER works by accepting a predicate
- ORDER and LIMIT work as expected
-
SAMPLE works by selecting a random tuple with probability p, in this case .001. This tuple is placed in a new relation
WORKFLOW
- DML
> SPLIT sensors INTO machine_three IF machine_name==' Machine-3', machine_two IF machine_name==' Machine-2', machine_one IF (machine_name == ' Machine-1');
> ... computation on each partition
> biased_sample = UNION ONSCHEMA (UNION ONSCHEMA (SAMPLE machine_one .01) , (SAMPLE machine_two .05)) , (SAMPLE machine_three .10);
> STORE biased_sample INTO 'biased_sample_checkpoint';
- SPLIT works by partitioning a single relation into one or more relations.
- Don't use SPLITS in place of FILTERS
- We now have a graph of relations. We just as easily could have loaded multiple relations.
- We project the multiple relations onto a single using the UNION. The ONSCHEMA performs the union by field name and not position.
- We then STORE the data as a checkpoint
Queries
- Grouping
>sensor_group = GROUP sensors BY machine_name
>channel_one_sum_and_avg = FOREACH sensor_group GENERATE group ,AVG(sensors.channel_1),SUM(sensors.channel_1);
-
Fairly straightforward
we GROUP BY a predicate
- We can then use the usual assortment of aggregration operators,such as MIN,MAX,SUM,AVG ,on the group
- Note that we must "dot in" as the group is a tuple with an inner bag.
OPERATORS WE MISSeD
- DIFF - Computes a BAG (set) difference
- COGROUP - Makes Hadoop cry :{
- DISTINCT
- String operators you would expect - indexOf, substring,split..
- Math operators you would expect - logs, trig functions...
UDF
- Why, well we need to extend pig
- What is a UDF in Pig?
- They are Java classes that implement a certain interface
- Python is possible but I've never done it.
- They receive tuples and output values
- They can be arbitrarily complex
Kinds of UDFS
- Eval UDF - Take a tuple gives a tuple.
- Filter UDF - Takes a tuple returns a boolean.
- Eval UDF's implementing ALGEBRAIC interface
- Eval UDF's implementing ACCUMULATOR interface
- STORE Funcs
-
Load Funcs
EVAL Funcs
- All evaluation functions extend the Java class org.apache.pig.EvalFunc.
- This class uses Java generics.
- It is parameterized by the return type of your UDF.
- As input it takes a tuple, which contains all of the fields the script passes to your UDF.
- It returns the type by which you parameterized EvalFunc.
DATE TO MILLIS
- Have a look at the udf barrett.udf.DateToMillis
- This class extends EvalFunc and its purpose is to transform some datetime formats into a long value.
- Let's look at the code....
- The exec method is thin
- Tried to push everything into other classes where I don't deal with the interfaces
unit Testing DateTOMILLIS
- Do as much as I can in true unit test style
- Push much of the testing onto the companion object
- In tests like "dateToMilli udf parses date correctly", I pass in a Tuple to the UDF and assert on the return value.
- I feel like this is the largest test that still maintains the UDF as the SUT.
- Exploit Databags and PigTuples
Outputting Schema
- What if we want to output something more complex?
- Emitting whole rows for instance?
- When your UDF returns a bag or a tuple you will
need to implement outputSchema if you want Pig to understand the contents of that bag
or tuple.
SCHEMAS
- Schema is an array of FieldSchema
- FieldSchema is a 3-tuple of (Alias,Type,Schema)
- Alias is the field name.
- Type is the Pig DataType
- Notice Schema/FieldSchema is recursive. This is how non-atomic values are achieved.
WEATHER PARSER
- Wanted to look at lots of NOAA Data - 14GB uncompressed
- Required to take the input data with too many fields and reduce it down to just the required fields
- barrett.udf.SimpleWeatherParser...
- Specs in 'VerifyWeatherRecordsAreParsed'
AlgebRAIC
- What if we want to implement an aggregative function like MAX, SUM or AVG ?
- Implement the ALGEBRAIC interface like in barrett.udf.GeometricMean
- Algebraic functions work in three stages
- Initial - Called with a Bag with a single Tuple during the map phase
- Intermediate - Called with a Bag with multiple Tuples during the combine phase
- Final - Called with a Bag with multiple Tuples emitted from the combine phase. Typically wraps all the values up in some way.
LOAD FUNC
- Pig's load function is built on top of a
InputFormat
, the class which Hadoop uses to read data.
-
InputFormat
serves two functions: it determines how input will be split between map tasks, and it provides aRecordReader
that produces key value pairs as input to those map tasks.
- The load function takes these key value pairs and
returns a Pig
Tuple
.
LOAD PROCESS
- Get input format - defines how to read data from a file into the Mapper instances.
- Find the storage location via the setLocation method- This is where you can expose datasources other than HDFS. We've already seen numerous examples.
- Casting Function - ByteArray to Types - Needed if your storage is keeping the PigTuples at ByteArrays.
-
Reading Records - Use the recordReader to iterate over records. Terminates when null is encountered.
BONUS
- Add partitions if your storage mechanism supports it. I know S3 does as well as HBase.
- In
order for Pig to request the relevant partitions, it must know how the
data is partitioned. Pig determines this by calling
getPartitionKeys
.
- Add schema meta-data. Supports removing fields names from load statements. Also is used in partitioning.
STORE FUNC
- Basically the inverse of LOAD
- The STORE/LOAD functions are critically important.
- Unfortunately the code is depressingly boiler-plate so I didn't implement in Scala.
- You can find plenty of examples in the PIG SVN Repo
- This is a good reference as well -> http://developer.yahoo.com/hadoop/tutorial/module5.html#inputformat
PIG UNIT
- A framework for "unit testing" pig scripts
- Lets run over the specs in UDFSpecs to see the options...
- If it wasn't clear asserts are based on text diffs
- You can pass in existing pig scripts
- You can pass in output and input as a file
inTegration testing datetomillis
- Look in barrett.udf.UDFSpecs
- Look at the test titled "VerifyDateToMillisParser"
- Ignore the args val for now.
- Notice how the script val looks like a pig script!
- Our input val is going to be aliased into the datesAsYYYYMMDD relation, basically injecting it into the script.
- When this is executed a local hadoop cluster will be initialized and the script run.
- The dateToMillis relation will then be compared against our output val
Maybe a workflow for AnalystS
- I'm a bit overzealous about CI and having all the business in source and testable
- But some (alot?) of Hadoop users aren't developers, so writing JUNIT tests might not work.
- This is what PIG was even developed for (analysts at yahoo)
- How can we enable analysts to have their script verified by CI facilities?
- You could step outside of JUnit but then you have test results scattered everywhere.
- My thought was checkin :
My ThoughtS
- Checkin : the pig script, input and output into source.
- Use code generation to create individual pig unit tests for each pig triple (script,input, output)
- Real reason : I want to play with Scala Macros
TiME FOR SOME NEXT LEVEL STUFF
- I want to do iteration
- I want to solve this problem : clustering
- Its a common machine learning task
- I don't want to use Mahout
Scalding
- Scalding is an API wrapper for Cascading
- Cascading is an abstraction of Hadoop details
- You write your code in Scala and it is translated into the proper stuff..... maybe
- Seems much easier to create complex workflows and associated user defined functions
Word COunt Example
- First example is the older Fields based version
- I didn't play with this much as I'm most interested in static typing
- TypedWordCount displays the power a bit more
- Try dotting in on the val groups you'll see that you can express quite a bit in a few lines
- Let's see how many lines it takes us to write an EvalFunction...
Clustering
- First, I get to create a rather extensive domain model in my native language
- I'm transparently creating udfs as I write my object and class methods
- I'm able to confirm through the compiler that my iteration will behave correctly. This is through @tailrec call
- The overhead for me to do this in PIG and certainly in Java would probably be too high
DONE
Thats it!
pig
By bearrito
pig
- 2,140