org.apache.pig
Class EvalFunc<T>

java.lang.Object
  extended by org.apache.pig.EvalFunc<T>
Direct Known Subclasses:
ABS, ABS, AccumulatorEvalFunc, AddDuration, AlgebraicByteArrayMathBase, AlgebraicByteArrayMathBase.Initial, AlgebraicDoubleMathBase, AlgebraicFloatMathBase, AlgebraicIntMathBase, AlgebraicLongMathBase, AlgebraicMathBase.Final, AlgebraicMathBase.Intermediate, ARITY, AVG, AVG.Final, AVG.Initial, AVG.Intermediate, BagSize, BagToString, BagToTuple, Base, Base, Bin, BinCond, BuildBloomBase, CONCAT, ConstantSize, copySign, COR, COR, COR.Final, COR.Final, COR.Initial, COR.Initial, COR.Intermed, COR.Intermed, COUNT, COUNT_STAR, COUNT_STAR.Final, COUNT_STAR.Initial, COUNT_STAR.Intermediate, COUNT.Final, COUNT.Initial, COUNT.Intermediate, COV, COV, COV.Final, COV.Final, COV.Initial, COV.Initial, COV.Intermed, COV.Intermed, CubeDimensions, CurrentTime, CustomFormatToISO, DateExtractor, DaysBetween, Decode, DIFF, DiffDate, Distinct, Distinct.Final, Distinct.Initial, Distinct.Intermediate, DoubleAbs, DoubleAvg, DoubleAvg.Final, DoubleAvg.Initial, DoubleAvg.Intermediate, DoubleCopySign, DoubleGetExponent, DoubleMax, DoubleMin, DoubleNextAfter, DoubleNextup, DoubleRound, DoubleRound, DoubleSignum, DoubleUlp, ExtremalTupleByNthField, ExtremalTupleByNthField.HelperClass, FilterFunc, FindQuantiles, FloatAbs, FloatAbs, FloatAvg, FloatAvg.Final, FloatAvg.Initial, FloatAvg.Intermediate, FloatCopySign, FloatGetExponent, FloatMax, FloatMin, FloatNextAfter, FloatNextup, FloatRound, FloatRound, FloatSignum, FloatUlp, GenericInvoker, GetDay, getExponent, GetHour, GetMemNumRows, GetMilliSecond, GetMinute, GetMonth, GetSecond, GetWeek, GetWeekYear, GetYear, GFAny, GFCross, GFReplicate, GroovyEvalFunc, HashFNV, HostExtractor, HoursBetween, IdentityColumn, INDEXOF, IntAbs, IntAbs, IntAvg, IntAvg.Final, IntAvg.Initial, IntAvg.Intermediate, IntMax, IntMin, INVERSEMAP, IsDouble, IsFloat, IsInt, IsLong, IsNumeric, ISODaysBetween, ISOHoursBetween, ISOMinutesBetween, ISOMonthsBetween, ISOSecondsBetween, ISOToDay, ISOToHour, ISOToMinute, ISOToMonth, ISOToSecond, ISOToUnix, ISOToWeek, ISOToYear, ISOYearsBetween, JrubyAlgebraicEvalFunc.AlgebraicFunctionWrapper, JrubyEvalFunc, JsFunction, JythonFunction, KEYSET, LAST_INDEX_OF, LCFIRST, LENGTH, LongAbs, LongAbs, LongAvg, LongAvg.Final, LongAvg.Initial, LongAvg.Intermediate, LongMax, LongMin, LookupInFiles, LOWER, MapSize, MAX, MaxTupleBy1stField, MaxTupleBy1stField.Final, MaxTupleBy1stField.Initial, MaxTupleBy1stField.Intermediate, MilliSecondsBetween, MIN, MinutesBetween, MonthsBetween, nextAfter, NEXTUP, PartitionSkewedKeys, RANDOM, RANDOM, ReadScalars, REGEX_EXTRACT, REGEX_EXTRACT_ALL, RegexExtract, RegexExtractAll, RegexMatch, REPLACE, Reverse, RollupDimensions, ROUND, ROUND, SCALB, SearchEngineExtractor, SearchQuery, SearchTermExtractor, SecondsBetween, SIGNUM, SIZE, STARTSWITH, StringConcat, StringMax, StringMax.Final, StringMax.Initial, StringMax.Intermediate, StringMin, StringMin.Final, StringMin.Initial, StringMin.Intermediate, StringSize, STRSPLIT, SUBSTRING, SubtractDuration, ToBag, TOBAG, ToDate, ToDate2ARGS, ToDate3ARGS, ToDateISO, TOKENIZE, TOMAP, ToMilliSeconds, Top, TOP, Top.Final, TOP.Final, Top.Initial, TOP.Initial, Top.Intermed, TOP.Intermed, ToString, ToTuple, TOTUPLE, ToUnixTime, TRIM, TupleSize, TypedOutputEvalFunc, UCFIRST, ULP, UnixToISO, UPPER, UPPER, VALUELIST, VALUESET, WeeksBetween, YearsBetween

@InterfaceAudience.Public
@InterfaceStability.Stable
public abstract class EvalFunc<T>
extends Object

The class is used to implement functions to be applied to fields in a dataset. The function is applied to each Tuple in the set. The programmer should not make assumptions about state maintained between invocations of the exec() method since the Pig runtime will schedule and localize invocations based on information provided at runtime. The programmer also should not make assumptions about when or how many times the class will be instantiated, since it may be instantiated multiple times in both the front and back end.


Field Summary
protected  org.apache.commons.logging.Log log
          Logging object.
protected  PigLogger pigLogger
          Logger for aggregating warnings.
protected  PigProgressable reporter
          Reporter to send heartbeats to Hadoop.
protected  Type returnType
          Return type of this instance of EvalFunc.
 
Constructor Summary
EvalFunc()
           
 
Method Summary
abstract  T exec(Tuple input)
          This callback method must be implemented by all subclasses.
 void finish()
          Placeholder for cleanup to be performed at the end.
 List<FuncSpec> getArgToFuncMapping()
          Allow a UDF to specify type specific implementations of itself.
 List<String> getCacheFiles()
          Allow a UDF to specify a list of files it would like placed in the distributed cache.
 Schema getInputSchema()
          This method is intended to be called by the user in EvalFunc to get the input schema of the EvalFunc
 org.apache.commons.logging.Log getLogger()
           
 PigLogger getPigLogger()
           
 PigProgressable getReporter()
           
 Type getReturnType()
          Get the Type that this EvalFunc returns.
protected  String getSchemaName(String name, Schema input)
           
 boolean isAsynchronous()
          Deprecated. 
 Schema outputSchema(Schema input)
          Report the schema of the output of this UDF.
 void progress()
          Utility method to allow UDF to report progress.
 void setInputSchema(Schema input)
          This method is for internal use.
 void setPigLogger(PigLogger pigLogger)
          Set the PigLogger object.
 void setReporter(PigProgressable reporter)
          Set the reporter.
 void setUDFContextSignature(String signature)
          This method will be called by Pig both in the front end and back end to pass a unique signature to the EvalFunc.
 void warn(String msg, Enum warningEnum)
          Issue a warning.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

reporter

protected PigProgressable reporter
Reporter to send heartbeats to Hadoop. If exec will take more than a a few seconds PigProgressable.progress() should be called occasionally to avoid timeouts. Default Hadoop timeout is 600 seconds.


log

protected org.apache.commons.logging.Log log
Logging object. Log calls made on the front end will be sent to pig's log on the client. Log calls made on the backend will be sent to stdout and can be seen in the Hadoop logs.


pigLogger

protected PigLogger pigLogger
Logger for aggregating warnings. Any warnings to be sent to the user should be logged to this via PigLogger.warn(java.lang.Object, java.lang.String, java.lang.Enum).


returnType

protected Type returnType
Return type of this instance of EvalFunc.

Constructor Detail

EvalFunc

public EvalFunc()
Method Detail

getSchemaName

protected String getSchemaName(String name,
                               Schema input)

getReturnType

public Type getReturnType()
Get the Type that this EvalFunc returns.

Returns:
Type

progress

public final void progress()
Utility method to allow UDF to report progress. If exec will take more than a a few seconds PigProgressable.progress() should be called occasionally to avoid timeouts. Default Hadoop timeout is 600 seconds.


warn

public final void warn(String msg,
                       Enum warningEnum)
Issue a warning. Warning messages are aggregated and reported to the user.

Parameters:
msg - String message of the warning
warningEnum - type of warning

finish

public void finish()
Placeholder for cleanup to be performed at the end. User defined functions can override. Default implementation is a no-op.


exec

public abstract T exec(Tuple input)
                throws IOException
This callback method must be implemented by all subclasses. This is the method that will be invoked on every Tuple of a given dataset. Since the dataset may be divided up in a variety of ways the programmer should not make assumptions about state that is maintained between invocations of this method.

Parameters:
input - the Tuple to be processed.
Returns:
result, of type T.
Throws:
IOException

outputSchema

public Schema outputSchema(Schema input)
Report the schema of the output of this UDF. Pig will make use of this in error checking, optimization, and planning. The schema of input data to this UDF is provided.

The default implementation interprets the OutputSchema annotation, if one is present. Otherwise, it returns null (no known output schema).

Parameters:
input - Schema of the input
Returns:
Schema of the output

isAsynchronous

@Deprecated
public boolean isAsynchronous()
Deprecated. 

This function should be overriden to return true for functions that return their values asynchronously. Currently pig never attempts to execute a function asynchronously.

Returns:
true if the function can be executed asynchronously.

getReporter

public PigProgressable getReporter()

setReporter

public final void setReporter(PigProgressable reporter)
Set the reporter. Called by Pig to provide a reference of the reporter to the UDF.

Parameters:
reporter - Hadoop reporter

getArgToFuncMapping

public List<FuncSpec> getArgToFuncMapping()
                                   throws FrontendException
Allow a UDF to specify type specific implementations of itself. For example, an implementation of arithmetic sum might have int and float implementations, since integer arithmetic performs much better than floating point arithmetic. Pig's typechecker will call this method and using the returned list plus the schema of the function's input data, decide which implementation of the UDF to use.

Returns:
A List containing FuncSpec objects representing the EvalFunc class which can handle the inputs corresponding to the schema in the objects. Each FuncSpec should be constructed with a schema that describes the input for that implementation. For example, the sum function above would return two elements in its list:
  1. FuncSpec(this.getClass().getName(), new Schema(new Schema.FieldSchema(null, DataType.DOUBLE)))
  2. FuncSpec(IntSum.getClass().getName(), new Schema(new Schema.FieldSchema(null, DataType.INTEGER)))
This would indicate that the main implementation is used for doubles, and the special implementation IntSum is used for ints.
Throws:
FrontendException

getCacheFiles

public List<String> getCacheFiles()
Allow a UDF to specify a list of files it would like placed in the distributed cache. These files will be put in the cache for every job the UDF is used in. The default implementation returns null.

Returns:
A list of files

getPigLogger

public PigLogger getPigLogger()

setPigLogger

public final void setPigLogger(PigLogger pigLogger)
Set the PigLogger object. Called by Pig to provide a reference to the UDF.

Parameters:
pigLogger - PigLogger object.

getLogger

public org.apache.commons.logging.Log getLogger()

setUDFContextSignature

public void setUDFContextSignature(String signature)
This method will be called by Pig both in the front end and back end to pass a unique signature to the EvalFunc. The signature can be used to store into the UDFContext any information which the EvalFunc needs to store between various method invocations in the front end and back end.

Parameters:
signature - a unique signature to identify this EvalFunc

setInputSchema

public void setInputSchema(Schema input)
This method is for internal use. It is called by Pig core in both front-end and back-end to setup the right input schema for EvalFunc


getInputSchema

public Schema getInputSchema()
This method is intended to be called by the user in EvalFunc to get the input schema of the EvalFunc



Copyright © 2007-2012 The Apache Software Foundation