org.apache.pig.builtin
Class TOP

java.lang.Object
  extended by org.apache.pig.EvalFunc<DataBag>
      extended by org.apache.pig.builtin.TOP
All Implemented Interfaces:
Algebraic

public class TOP
extends EvalFunc<DataBag>
implements Algebraic

Top UDF accepts a bag of tuples and returns top-n tuples depending upon the tuple field value of type long. Both n and field number needs to be provided to the UDF. The UDF iterates through the input bag and just retains top-n tuples by storing them in a priority queue of size n+1 where priority is the long field. This is efficient as priority queue provides constant time - O(1) removal of the least element and O(log n) time for heap restructuring. The UDF is especially helpful for turning the nested grouping operation inside out and retaining top-n in a nested group. Assumes all tuples in the bag contain an element of the same type in the compared column. Sample usage: A = LOAD 'test.tsv' as (first: chararray, second: chararray); B = GROUP A BY (first, second); C = FOREACH B generate FLATTEN(group), COUNT(*) as count; D = GROUP C BY first; // again group by first topResults = FOREACH D { result = Top(10, 2, C); // and retain top 10 occurrences of 'second' in first GENERATE FLATTEN(result); }


Nested Class Summary
static class TOP.Final
           
static class TOP.Initial
           
static class TOP.Intermed
           
 
Field Summary
 
Fields inherited from class org.apache.pig.EvalFunc
pigLogger, reporter, returnType
 
Constructor Summary
TOP()
           
 
Method Summary
 DataBag exec(Tuple tuple)
          This callback method must be implemented by all subclasses.
 List<FuncSpec> getArgToFuncMapping()
          Allow a UDF to specify type specific implementations of itself.
 String getFinal()
          Get the final function.
 String getInitial()
          Get the initial function.
 String getIntermed()
          Get the intermediate function.
 Schema outputSchema(Schema input)
          Report the schema of the output of this UDF.
protected static void updateTop(PriorityQueue<Tuple> store, int limit, DataBag inputBag)
           
 
Methods inherited from class org.apache.pig.EvalFunc
finish, getLogger, getPigLogger, getReporter, getReturnType, getSchemaName, isAsynchronous, progress, setPigLogger, setReporter, warn
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

TOP

public TOP()
Method Detail

exec

public DataBag exec(Tuple tuple)
             throws IOException
Description copied from class: EvalFunc
This callback method must be implemented by all subclasses. This is the method that will be invoked on every Tuple of a given dataset. Since the dataset may be divided up in a variety of ways the programmer should not make assumptions about state that is maintained between invocations of this method.

Specified by:
exec in class EvalFunc<DataBag>
Parameters:
tuple - the Tuple to be processed.
Returns:
result, of type T.
Throws:
IOException

updateTop

protected static void updateTop(PriorityQueue<Tuple> store,
                                int limit,
                                DataBag inputBag)

getArgToFuncMapping

public List<FuncSpec> getArgToFuncMapping()
                                   throws FrontendException
Description copied from class: EvalFunc
Allow a UDF to specify type specific implementations of itself. For example, an implementation of arithmetic sum might have int and float implementations, since integer arithmetic performs much better than floating point arithmetic. Pig's typechecker will call this method and using the returned list plus the schema of the function's input data, decide which implementation of the UDF to use.

Overrides:
getArgToFuncMapping in class EvalFunc<DataBag>
Returns:
A List containing FuncSpec objects representing the EvalFunc class which can handle the inputs corresponding to the schema in the objects. Each FuncSpec should be constructed with a schema that describes the input for that implementation. For example, the sum function above would return two elements in its list:
  1. FuncSpec(this.getClass().getName(), new Schema(new Schema.FieldSchema(null, DataType.DOUBLE)))
  2. FuncSpec(IntSum.getClass().getName(), new Schema(new Schema.FieldSchema(null, DataType.INTEGER)))
This would indicate that the main implementation is used for doubles, and the special implementation IntSum is used for ints.
Throws:
FrontendException

outputSchema

public Schema outputSchema(Schema input)
Description copied from class: EvalFunc
Report the schema of the output of this UDF. Pig will make use of this in error checking, optimization, and planning. The schema of input data to this UDF is provided.

Overrides:
outputSchema in class EvalFunc<DataBag>
Parameters:
input - Schema of the input
Returns:
Schema of the output

getInitial

public String getInitial()
Description copied from interface: Algebraic
Get the initial function.

Specified by:
getInitial in interface Algebraic
Returns:
A function name of f_init. f_init should be an eval func.

getIntermed

public String getIntermed()
Description copied from interface: Algebraic
Get the intermediate function.

Specified by:
getIntermed in interface Algebraic
Returns:
A function name of f_intermed. f_intermed should be an eval func.

getFinal

public String getFinal()
Description copied from interface: Algebraic
Get the final function.

Specified by:
getFinal in interface Algebraic
Returns:
A function name of f_final. f_final should be an eval func parametrized by the same datum as the eval func implementing this interface.


Copyright © ${year} The Apache Software Foundation