org.apache.pig.piggybank.evaluation
Class Over

java.lang.Object
  extended by org.apache.pig.EvalFunc<DataBag>
      extended by org.apache.pig.piggybank.evaluation.Over

public class Over
extends EvalFunc<DataBag>

Given an aggregate function, a bag, and possibly a window definition, produce output that matches SQL OVER. It is the reponsibility of the caller to have already ordered the bag as required by their operation. The aggregate, and window definition are passed in the constructor. The bag is passed to exec each time.

Usage: Over(bag, function_to_call[, window_start, window_end[, function specific args]])

bag - The bag to be called. Most functions assume this is a bag with tuples of a single field.

function_to_call - Can be one of the following:

window_start - optional - Record to start window on for the function. -1 indicates 'unbounded preceding', i.e. the beginning of the bag. A positive integer indicates that number of records before the current record. 0 indicates the current record. If not specified -1 is the default.

window_end - optional - Record to end window on for the function. -1 indicates 'unbounded following', i.e. the end of the bag. A positive integer indicates that number of records after the current record. 0 indicates teh current record. If not specified 0 is the default.

function_specific_args - maybe optional - The following functions accept require additional arguments:

Example Usage:

To do a cumulative sum:

 A = load 'T' AS (si:chararray, i:int, d:long, f:float, s:chararray);
 C = foreach (group A by si) {
     Aord = order A by d;
     generate flatten(Stitch(Aord, Over(Aord.f, 'sum(float)')));
 }
 D = foreach C generate s, $5;

This is equivalent to the SQL statement

select s, sum(f) over (partition by si order by d) from T;

To find the record 3 ahead of the current record, using a window between the current row and 3 records ahead and a default value of 0.

 A = load 'T' AS (si:chararray, i:int, d:long, f:float, s:chararray);
 C = foreach (group A by si) {
     Aord = order A by i;
     generate flatten(Stitch(Aord, Over(Aord.i, 'lead', 0, 3, 3, 0)));
 }
 D = foreach C generate s, $9;

This is equivalent to the SQL statement

select s, lead(i, 3, 0) over (partition by si order by i rows between current row and 3 following) over T;

Over accepts a constructor argument specifying the name and type, colon-separated, of its return schema.

 DEFINE IOver org.apache.pig.piggybank.evaluation.Over('state_rk:int');
 cities = LOAD 'cities' AS (city:chararray, state:chararray, pop:int);
 -- Decorate each city with its population rank within the state it belongs to:
 ranked = FOREACH(GROUP cities BY state) {
   c_ord = ORDER cities BY pop DESC;
   GENERATE FLATTEN(Stitch(c_ord,
     IOver(c_ord, 'rank', -1, -1, 2))); -- beginning (-1) to end (-1) on third field (2)
 };
 DESCRIBE ranked;
 -- ranked: {stitched::city: chararray,stitched::state: chararray,stitched::pop: int,stitched::state_rk: int}
 DUMP ranked;
 -- ...
 -- (Nashville,Tennessee,609644,2)
 -- (Houston,Texas,2145146,1)
 -- (San Antonio,Texas,1359758,2)
 -- (Dallas,Texas,1223229,3)
 -- (Austin,Texas,820611,4)
 -- ...
 


Nested Class Summary
 
Nested classes/interfaces inherited from class org.apache.pig.EvalFunc
EvalFunc.SchemaType
 
Field Summary
 
Fields inherited from class org.apache.pig.EvalFunc
log, pigLogger, reporter
 
Constructor Summary
Over()
           
Over(String typespec)
           
 
Method Summary
 DataBag exec(Tuple input)
          This callback method must be implemented by all subclasses.
 Schema outputSchema(Schema inputSch)
          Report the schema of the output of this UDF.
 
Methods inherited from class org.apache.pig.EvalFunc
finish, getArgToFuncMapping, getCacheFiles, getInputSchema, getLogger, getPigLogger, getReporter, getReturnType, getSchemaName, getSchemaType, isAsynchronous, progress, setInputSchema, setPigLogger, setReporter, setUDFContextSignature, warn
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

Over

public Over()

Over

public Over(String typespec)
Method Detail

exec

public DataBag exec(Tuple input)
             throws IOException
Description copied from class: EvalFunc
This callback method must be implemented by all subclasses. This is the method that will be invoked on every Tuple of a given dataset. Since the dataset may be divided up in a variety of ways the programmer should not make assumptions about state that is maintained between invocations of this method.

Specified by:
exec in class EvalFunc<DataBag>
Parameters:
input - the Tuple to be processed.
Returns:
result, of type T.
Throws:
IOException

outputSchema

public Schema outputSchema(Schema inputSch)
Description copied from class: EvalFunc
Report the schema of the output of this UDF. Pig will make use of this in error checking, optimization, and planning. The schema of input data to this UDF is provided.

The default implementation interprets the OutputSchema annotation, if one is present. Otherwise, it returns null (no known output schema).

Overrides:
outputSchema in class EvalFunc<DataBag>
Parameters:
inputSch - Schema of the input
Returns:
Schema of the output


Copyright © 2007-2012 The Apache Software Foundation