public class Over extends EvalFunc<DataBag>
Usage: Over(bag, function_to_call[, window_start, window_end[, function specific args]])
bag - The bag to be called. Most functions assume this is a bag with tuples of a single field.
function_to_call - Can be one of the following:
window_start - optional - Record to start window on for the function. -1 indicates 'unbounded preceding', i.e. the beginning of the bag. A positive integer indicates that number of records before the current record. 0 indicates the current record. If not specified -1 is the default.
window_end - optional - Record to end window on for the function. -1 indicates 'unbounded following', i.e. the end of the bag. A positive integer indicates that number of records after the current record. 0 indicates teh current record. If not specified 0 is the default.
function_specific_args - maybe optional - The following functions accept require additional arguments:
Example Usage:
To do a cumulative sum:
A = load 'T' AS (si:chararray, i:int, d:long, f:float, s:chararray); C = foreach (group A by si) { Aord = order A by d; generate flatten(Stitch(Aord, Over(Aord.f, 'sum(float)'))); } D = foreach C generate s, $5;
This is equivalent to the SQL statement
select s, sum(f) over (partition by si order by d) from T;
To find the record 3 ahead of the current record, using a window between the current row and 3 records ahead and a default value of 0.
A = load 'T' AS (si:chararray, i:int, d:long, f:float, s:chararray); C = foreach (group A by si) { Aord = order A by i; generate flatten(Stitch(Aord, Over(Aord.i, 'lead', 0, 3, 3, 0))); } D = foreach C generate s, $9;
This is equivalent to the SQL statement
select s, lead(i, 3, 0) over (partition by si order by i rows between current row and 3 following) over T;
Over accepts a constructor argument specifying the name and type, colon-separated, of its return schema.
DEFINE IOver org.apache.pig.piggybank.evaluation.Over('state_rk:int'); cities = LOAD 'cities' AS (city:chararray, state:chararray, pop:int); -- Decorate each city with its population rank within the state it belongs to: ranked = FOREACH(GROUP cities BY state) { c_ord = ORDER cities BY pop DESC; GENERATE FLATTEN(Stitch(c_ord, IOver(c_ord, 'rank', -1, -1, 2))); -- beginning (-1) to end (-1) on third field (2) }; DESCRIBE ranked; -- ranked: {stitched::city: chararray,stitched::state: chararray,stitched::pop: int,stitched::state_rk: int} DUMP ranked; -- ... -- (Nashville,Tennessee,609644,2) -- (Houston,Texas,2145146,1) -- (San Antonio,Texas,1359758,2) -- (Dallas,Texas,1223229,3) -- (Austin,Texas,820611,4) -- ...
EvalFunc.SchemaType
Modifier and Type | Method and Description |
---|---|
DataBag |
exec(Tuple input)
This callback method must be implemented by all subclasses.
|
Schema |
outputSchema(Schema inputSch)
Report the schema of the output of this UDF.
|
allowCompileTimeCalculation, finish, getArgToFuncMapping, getCacheFiles, getInputSchema, getLogger, getPigLogger, getReporter, getReturnType, getSchemaName, getSchemaType, getShipFiles, isAsynchronous, needEndOfAllInputProcessing, progress, setEndOfAllInput, setInputSchema, setPigLogger, setReporter, setUDFContextSignature, warn
public Over()
public Over(String typespec)
public DataBag exec(Tuple input) throws IOException
EvalFunc
exec
in class EvalFunc<DataBag>
input
- the Tuple to be processed.IOException
public Schema outputSchema(Schema inputSch)
EvalFunc
The default implementation interprets the OutputSchema
annotation,
if one is present. Otherwise, it returns null
(no known output schema).
outputSchema
in class EvalFunc<DataBag>
inputSch
- Schema of the inputCopyright © 2007-2012 The Apache Software Foundation