Barrett Strausser
@FuncBearrito
Machine-1 2012-11-28 22:42:00 0.1722 0.108 6.2504
Machine-2 2012-11-28 22:42:00 0.0185 0.3336 3.316
Machine-3 2012-11-28 22:42:00 1.6843 0.2725 1.3314
Machine-1 2012-11-28 22:42:00 0.1482 0.1422 0.3965
$ /opt/pig/bin/pig
> pwd
> ls
> copyFromLocal /home/hduser/mapper_input /user/hduser/mapper_input
> sensors = load '/user/hduser/mapper_input' AS (machine_name:chararray,record_date:chararray,channel_1:float,channel_2:float,channel_3:float);
> DESCRIBE sensors;
> ILLUSTRATE sensors;
> positive_channel_1 = FILTER sensors BY channel_1 > 0;
> ILLUSTRATE postive_channel_1;
> just_channel_2 = FOREACH sensors GENERATE channel_2;
> ILLUSTRATE just_channel_2;
> large_channel_3 = ORDER sensors BY channel_3 DESC;
> top_ten_channel_3 = LIMIT large_channel_3 10;
> subset = SAMPLE sensors .001;
> DUMP subset;
> SPLIT sensors INTO machine_three IF machine_name==' Machine-3', machine_two IF machine_name==' Machine-2', machine_one IF (machine_name == ' Machine-1');
> ... computation on each partition
> biased_sample = UNION ONSCHEMA (UNION ONSCHEMA (SAMPLE machine_one .01) , (SAMPLE machine_two .05)) , (SAMPLE machine_three .10);
> STORE biased_sample INTO 'biased_sample_checkpoint';
>sensor_group = GROUP sensors BY machine_name
>channel_one_sum_and_avg = FOREACH sensor_group GENERATE group ,AVG(sensors.channel_1),SUM(sensors.channel_1);
Fairly straightforward
we GROUP BY a predicate
InputFormat
, the class which Hadoop uses to read
data. InputFormat
serves two functions: it
determines how input will be split between map tasks, and it provides a
RecordReader
that produces key value pairs as input
to those map tasks. Tuple
.getPartitionKeys
. Thats it!