org.apache.pig.piggybank.evaluation
Class ExtremalTupleByNthField

java.lang.Object
  extended by org.apache.pig.EvalFunc<Tuple>
      extended by org.apache.pig.piggybank.evaluation.ExtremalTupleByNthField
All Implemented Interfaces:
Accumulator<Tuple>, Algebraic

public class ExtremalTupleByNthField
extends EvalFunc<Tuple>
implements Algebraic, Accumulator<Tuple>

This class is similar to MaxTupleBy1stField except that it allows you to specify with field to use (instead of just using 1st field) and to specify ascending or descending. The first parameter in the constructor specifies which field to use and the second parameter to the constructor specifies which extremal to retrieve. Strings prefixed by "min", "least", "desc", "small" and "-", irrespective of capitalization and leading white spacing, specifies the computation of the minimum and all other strings means maximum;

Example1: Invoking the UDF

e.g. Using this udf:
define myMin ExtremalTupleByNthField( '4', 'min' ); T = group G ALL; R = foreach T generate myMin(G); is equivalent to:
T = order G by $3 asc; R = limit G 1; Note above 4 indicates the field with index 3 in the tuple. The 4th field can be any comparable type, so you can use float, int, string, or even tuples. By default constructor, this UDF behaves as MaxTupleBy1stField in that it chooses the max tuple by the comparable in the first field.

Example 2: Default behavior

This class also has a one parameter constructor that specifies the index and takes the max tuple from the bag. define myMax ExtremalTupleByNthField( '3' ); T = group G ALL; R = foreach T generate myMax(G); is equivalent to:
T = order G by $2 desc; R = limit G 1;

Example 3: Choosing a Large Bag or Tuple

Another possible use case is the choosing of larger or smaller bags/tuples. In pig, bags and tuples are comparable and the comparison is based on size. define biggestBag ExtremalTupleByNthField('1', max); R = group TABLE by (key1, key2); G = cogroup L by key1, R by group.key1; V = foreach G generate L, biggestBag(R); This results in each L(eft) bag associated with only the largest bag from the R(ight) table. If all bags in R are of equal size, the comparator continues on to perform element-wise comparison. In case of a complete tie in the comparison, which result is returned is nondeterministic. But because this class is able to compare any comparable we are able to specify a secondary key.

Example 4: Secondary Sort Key

define biggestBag ExtremalTupleByNthField('1', max); G = cogroup L by key1, M by key1, R by key1; V = foreach G generate FLATTEN(L), biggestBag(R.($0, $1, $2, $5)) as best_result_by_0, biggestBag(R.($3, $1, $2, $5)) as best_result_by_3, biggestBag(M.($0, $2)) as best_misc_data; this will generate two sets of results and misc data based on two separate criterion. Since all tuples in the bags have the same size (4, 4, 2 respectively), the tuple comparator continues on and compares the members of tuples until it finds one. best_result_by_0 and best_result_by3 are ordered by 1st and 4th member of the tuples. Within each group, ties are broken by second and third field. Finally, note that the udf implements both Algebraic and Accumulator, so it is relatively efficient because it's a one-pass algorithm.


Nested Class Summary
static class ExtremalTupleByNthField.HelperClass
          Utility classes and methods
 
Nested classes/interfaces inherited from class org.apache.pig.EvalFunc
EvalFunc.SchemaType
 
Field Summary
 
Fields inherited from class org.apache.pig.EvalFunc
log, pigLogger, reporter, returnType
 
Constructor Summary
ExtremalTupleByNthField()
          Constructors
ExtremalTupleByNthField(String fieldIndexString)
           
ExtremalTupleByNthField(String fieldIndexString, String order)
           
 
Method Summary
 void accumulate(Tuple b)
          Pass tuples to the UDF.
 void cleanup()
          Called after getValue() to prepare processing for next key.
 Tuple exec(Tuple input)
          The EvalFunc interface
protected static Tuple extreme(int pind, int psign, Tuple input, PigProgressable reporter)
           
 String getFinal()
          Get the final function.
 String getInitial()
          Algebraic interface
 String getIntermed()
          Get the intermediate function.
 Type getReturnType()
          Get the Type that this EvalFunc returns.
 Tuple getValue()
          Called when all tuples from current key have been passed to accumulate.
 Schema outputSchema(Schema input)
          Report the schema of the output of this UDF.
protected static int parseFieldIndex(String inputFieldIndex)
           
protected static int parseOrdering(String order)
           
 
Methods inherited from class org.apache.pig.EvalFunc
finish, getArgToFuncMapping, getCacheFiles, getInputSchema, getLogger, getPigLogger, getReporter, getSchemaName, getSchemaType, isAsynchronous, progress, setInputSchema, setPigLogger, setReporter, setUDFContextSignature, warn
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

ExtremalTupleByNthField

public ExtremalTupleByNthField()
                        throws ExecException
Constructors

Throws:
ExecException

ExtremalTupleByNthField

public ExtremalTupleByNthField(String fieldIndexString)
                        throws ExecException
Throws:
ExecException

ExtremalTupleByNthField

public ExtremalTupleByNthField(String fieldIndexString,
                               String order)
                        throws ExecException
Throws:
ExecException
Method Detail

exec

public Tuple exec(Tuple input)
           throws IOException
The EvalFunc interface

Specified by:
exec in class EvalFunc<Tuple>
Parameters:
input - the Tuple to be processed.
Returns:
result, of type T.
Throws:
IOException

getReturnType

public Type getReturnType()
Description copied from class: EvalFunc
Get the Type that this EvalFunc returns.

Overrides:
getReturnType in class EvalFunc<Tuple>
Returns:
Type

outputSchema

public Schema outputSchema(Schema input)
Description copied from class: EvalFunc
Report the schema of the output of this UDF. Pig will make use of this in error checking, optimization, and planning. The schema of input data to this UDF is provided.

The default implementation interprets the OutputSchema annotation, if one is present. Otherwise, it returns null (no known output schema).

Overrides:
outputSchema in class EvalFunc<Tuple>
Parameters:
input - Schema of the input
Returns:
Schema of the output

getInitial

public String getInitial()
Algebraic interface

Specified by:
getInitial in interface Algebraic
Returns:
A function name of f_init. f_init should be an eval func. The return type of f_init.exec() has to be Tuple

getIntermed

public String getIntermed()
Description copied from interface: Algebraic
Get the intermediate function.

Specified by:
getIntermed in interface Algebraic
Returns:
A function name of f_intermed. f_intermed should be an eval func. The return type of f_intermed.exec() has to be Tuple

getFinal

public String getFinal()
Description copied from interface: Algebraic
Get the final function.

Specified by:
getFinal in interface Algebraic
Returns:
A function name of f_final. f_final should be an eval func parametrized by the same datum as the eval func implementing this interface.

accumulate

public void accumulate(Tuple b)
                throws IOException
Description copied from interface: Accumulator
Pass tuples to the UDF.

Specified by:
accumulate in interface Accumulator<Tuple>
Parameters:
b - A tuple containing a single field, which is a bag. The bag will contain the set of tuples being passed to the UDF in this iteration.
Throws:
IOException

cleanup

public void cleanup()
Description copied from interface: Accumulator
Called after getValue() to prepare processing for next key.

Specified by:
cleanup in interface Accumulator<Tuple>

getValue

public Tuple getValue()
Description copied from interface: Accumulator
Called when all tuples from current key have been passed to accumulate.

Specified by:
getValue in interface Accumulator<Tuple>
Returns:
the value for the UDF for this key.

extreme

protected static final Tuple extreme(int pind,
                                     int psign,
                                     Tuple input,
                                     PigProgressable reporter)
                              throws ExecException
Throws:
ExecException

parseFieldIndex

protected static int parseFieldIndex(String inputFieldIndex)
                              throws ExecException
Throws:
ExecException

parseOrdering

protected static int parseOrdering(String order)


Copyright © 2007-2012 The Apache Software Foundation