Apache Pig

Setup

  • Launch interactive local mode: ./pig -x local
    
  • docker-compose up
  • docker exec -it `hash` bash
  • github.com/michaltomanski/pig-demo

What is Pig

Software used to create Hadoop programs

E (Extract)

T (Transform)

L (Load) jobs

Pig Latin

  • Simple, procedural, SQL-like statements language for ETL
  • Lazy evaluation by default
  • Supports branching, parallel processing
  • Used defined functions can be included

Pig Latin

  • Most basic entity in Pig Latin
  • A classical table, with columns and rows.
  • Rows are unordered
  • IMMUTABLE
  • Data types can be nested (column's type can be a tuple)
  • Referred to with an alias

Relation

Load data

input/OlympicAthletes.csv

Athlete,       Country,  Year, Sport,          Gold, Silver, Bronze, Total
Yang Yilin,   China,     2008, Gymnastics,   1,       0,         2,          3

Ruolin,         China,     2008, Diving,            2,       0,         0,          2

Load data

athletes = LOAD 'input/OlympicAthletes.csv' 
  USING org.apache.pig.piggybank.storage.CSVExcelStorage
  (',', 'YES_MULTILINE', 'NOCHANGE', 'SKIP_INPUT_HEADER')
  AS (athlete:chararray, country:chararray, year:int, 
  sport:chararray, gold:int, silver:int, bronze:int, total:int);

Check relation schema

describe athletes;

Show data

DUMP athletes;

This is the first operation on the relation, so the actual evaluation (here, only the loading) takes place

Alternative: STORE

STORE athletes INTO 'directory';

Store data

STORE athletes INTO 'directory';

also Stream

GROUP, GENERATE,  AGGREGATE

by_country = GROUP athletes BY country;

(run Describe)

last_medal_for_country =
 FOREACH by_country 
 GENERATE group AS country, MAX(athletes.year) as year;
sum_by_country =
 FOREACH by_country 
 GENERATE group AS country, SUM(athletes.total) as year;

GROUP ALL, ORDER, LIMIT, DISTINCT

grouped_all = GROUP athletes ALL;

(run Describe)

count_all = FOREACH (GROUP athletes ALL) GENERATE COUNT(athletes);
countries = DISTINCT (FOREACH athletes GENERATE country);
countries_count = FOREACH (GROUP countries ALL) GENERATE COUNT(countries);
countries_or = LIMIT (ORDER countries BY country ASC) 10;

FILTER, JOIN

best = FILTER athletes BY gold >= 4;
copy = FOREACH athletes GENERATE *;
joined = JOIN athletes BY athlete, copy by athlete;

(run Describe)

consecutive = FILTER joined BY athletes::year == copy::year + 4 
  AND athletes::gold > 1 
  AND copy::gold > 1;

(explain)

User defined functions

  • Java
  • Python
  • Javascript

Eval function

package myudfs;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
  
public class UPPER extends EvalFunc<String> {
  public String exec(Tuple input) throws IOException {
      if (input == null || input.get(0) == null)
           return null;
       try {
           String str = (String)input.get(0);
           return str.toUpperCase();
       } catch(Exception e){
           throw new IOException("Caught exception processing row", e);
       }
   }
 }

UPPER.java

User defined functions

REGISTER input/myudfs.jar; 

Aggregate function

public class COUNT extends EvalFunc<Long> implements Algebraic{
    public Long exec(Tuple input) throws IOException {return count(input);}
    public String getInitial() {return Initial.class.getName();} // MAP
    public String getIntermed() {return Intermed.class.getName();} // COMBINER
    public String getFinal() {return Final.class.getName();} // REDUCER
    static public class Initial extends EvalFunc<Tuple> {
        public Tuple exec(Tuple input) throws IOException {return TupleFactory.getInstance().newTuple(count(input));}
    }
    static public class Intermed extends EvalFunc<Tuple> {
        public Tuple exec(Tuple input) throws IOException {return TupleFactory.getInstance().newTuple(sum(input));}
    }
    static public class Final extends EvalFunc<Long> {
        public Tuple exec(Tuple input) throws IOException {return sum(input);}
    }
    static protected Long count(Tuple input) throws ExecException {
        Object values = input.get(0);
        if (values instanceof DataBag) return ((DataBag)values).size();
        else if (values instanceof Map) return new Long(((Map)values).size());
    }
    static protected Long sum(Tuple input) throws ExecException, NumberFormatException {
        DataBag values = (DataBag)input.get(0);
        long sum = 0;
        for (Iterator (Tuple) it = values.iterator(); it.hasNext();) {
            Tuple t = it.next();
            sum += (Long)t.get(0);
        }
        return sum;
    }
}

Accumulator

public interface Accumulator <T> {
   /**
    * Process tuples. Each DataBag may contain 0 to many tuples for current key
    */
    public void accumulate(Tuple b) throws IOException;
    /**
     * Called when all tuples from current key have been passed to the accumulator.
     * @return the value for the UDF for this key.
     */
    public T getValue();
    /**
     * Called after getValue() to prepare processing for next key. 
     */
    public void cleanup();
}

BATCH MODE

deck

By Michał Tomański