Apache Pig
Setup
-
Launch interactive local mode: ./pig -x local
-
docker-compose up
-
docker exec -it `hash` bash
-
github.com/michaltomanski/pig-demo
What is Pig
Software used to create Hadoop programs
E (Extract)
T (Transform)
L (Load) jobs
Pig Latin
- Simple, procedural, SQL-like statements language for ETL
- Lazy evaluation by default
- Supports branching, parallel processing
- Used defined functions can be included
Pig Latin
- Most basic entity in Pig Latin
- A classical table, with columns and rows.
- Rows are unordered
- IMMUTABLE
- Data types can be nested (column's type can be a tuple)
- Referred to with an alias
Relation
Load data
input/OlympicAthletes.csv
Athlete, Country, Year, Sport, Gold, Silver, Bronze, Total
Yang Yilin, China, 2008, Gymnastics, 1, 0, 2, 3
Ruolin, China, 2008, Diving, 2, 0, 0, 2
Load data
athletes = LOAD 'input/OlympicAthletes.csv'
USING org.apache.pig.piggybank.storage.CSVExcelStorage
(',', 'YES_MULTILINE', 'NOCHANGE', 'SKIP_INPUT_HEADER')
AS (athlete:chararray, country:chararray, year:int,
sport:chararray, gold:int, silver:int, bronze:int, total:int);
Check relation schema
describe athletes;
Show data
DUMP athletes;
This is the first operation on the relation, so the actual evaluation (here, only the loading) takes place
Alternative: STORE
STORE athletes INTO 'directory';
Store data
STORE athletes INTO 'directory';
also Stream
GROUP, GENERATE, AGGREGATE
by_country = GROUP athletes BY country;
(run Describe)
last_medal_for_country =
FOREACH by_country
GENERATE group AS country, MAX(athletes.year) as year;
sum_by_country =
FOREACH by_country
GENERATE group AS country, SUM(athletes.total) as year;
GROUP ALL, ORDER, LIMIT, DISTINCT
grouped_all = GROUP athletes ALL;
(run Describe)
count_all = FOREACH (GROUP athletes ALL) GENERATE COUNT(athletes);
countries = DISTINCT (FOREACH athletes GENERATE country);
countries_count = FOREACH (GROUP countries ALL) GENERATE COUNT(countries);
countries_or = LIMIT (ORDER countries BY country ASC) 10;
FILTER, JOIN
best = FILTER athletes BY gold >= 4;
copy = FOREACH athletes GENERATE *;
joined = JOIN athletes BY athlete, copy by athlete;
(run Describe)
consecutive = FILTER joined BY athletes::year == copy::year + 4
AND athletes::gold > 1
AND copy::gold > 1;
(explain)
User defined functions
- Java
- Python
- Javascript
Eval function
package myudfs;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class UPPER extends EvalFunc<String> {
public String exec(Tuple input) throws IOException {
if (input == null || input.get(0) == null)
return null;
try {
String str = (String)input.get(0);
return str.toUpperCase();
} catch(Exception e){
throw new IOException("Caught exception processing row", e);
}
}
}
UPPER.java
User defined functions
REGISTER input/myudfs.jar;
Aggregate function
public class COUNT extends EvalFunc<Long> implements Algebraic{
public Long exec(Tuple input) throws IOException {return count(input);}
public String getInitial() {return Initial.class.getName();} // MAP
public String getIntermed() {return Intermed.class.getName();} // COMBINER
public String getFinal() {return Final.class.getName();} // REDUCER
static public class Initial extends EvalFunc<Tuple> {
public Tuple exec(Tuple input) throws IOException {return TupleFactory.getInstance().newTuple(count(input));}
}
static public class Intermed extends EvalFunc<Tuple> {
public Tuple exec(Tuple input) throws IOException {return TupleFactory.getInstance().newTuple(sum(input));}
}
static public class Final extends EvalFunc<Long> {
public Tuple exec(Tuple input) throws IOException {return sum(input);}
}
static protected Long count(Tuple input) throws ExecException {
Object values = input.get(0);
if (values instanceof DataBag) return ((DataBag)values).size();
else if (values instanceof Map) return new Long(((Map)values).size());
}
static protected Long sum(Tuple input) throws ExecException, NumberFormatException {
DataBag values = (DataBag)input.get(0);
long sum = 0;
for (Iterator (Tuple) it = values.iterator(); it.hasNext();) {
Tuple t = it.next();
sum += (Long)t.get(0);
}
return sum;
}
}
Accumulator
public interface Accumulator <T> {
/**
* Process tuples. Each DataBag may contain 0 to many tuples for current key
*/
public void accumulate(Tuple b) throws IOException;
/**
* Called when all tuples from current key have been passed to the accumulator.
* @return the value for the UDF for this key.
*/
public T getValue();
/**
* Called after getValue() to prepare processing for next key.
*/
public void cleanup();
}
BATCH MODE
deck
By Michał Tomański
deck
- 1,406