org.apache.pig.builtin
Class CubeDimensions

java.lang.Object
  extended by org.apache.pig.EvalFunc<DataBag>
      extended by org.apache.pig.builtin.CubeDimensions

public class CubeDimensions
extends EvalFunc<DataBag>

Produces a DataBag with all combinations of the argument tuple members as in a data cube. Meaning, (a, b, c) will produce the following bag:

 { (a, b, c), (null, null, null), (a, b, null), (a, null, c),
   (a, null, null), (null, b, c), (null, null, c), (null, b, null) }
 

The "all" marker is null by default, but can be set to an arbitrary string by invoking a constructor (via a DEFINE). The constructor takes a single argument, the string you want to represent "all".

Usage goes something like this:

events = load '/logs/events' using EventLoader() as (lang, event, app_id);
 cubed = foreach x generate
   FLATTEN(piggybank.CubeDimensions(lang, event, app_id))
     as (lang, event, app_id),
   measure;
 cube = foreach (group cubed
                 by (lang, event, app_id) parallel $P)
        generate
   flatten(group) as (lang, event, app_id),
   COUNT_STAR(cubed),
   SUM(measure);
 store cube into 'event_cube';
 

Note: doing this with non-algebraic aggregations on large data can result in very slow reducers, since one of the groups is going to get all the records in your relation.


Nested Class Summary
 
Nested classes/interfaces inherited from class org.apache.pig.EvalFunc
EvalFunc.SchemaType
 
Field Summary
 
Fields inherited from class org.apache.pig.EvalFunc
log, pigLogger, reporter, returnType
 
Constructor Summary
CubeDimensions()
           
CubeDimensions(String allMarker)
           
 
Method Summary
static void convertNullToUnknown(Tuple tuple)
           
 DataBag exec(Tuple tuple)
          This callback method must be implemented by all subclasses.
 Schema outputSchema(Schema input)
          Report the schema of the output of this UDF.
 
Methods inherited from class org.apache.pig.EvalFunc
finish, getArgToFuncMapping, getCacheFiles, getInputSchema, getLogger, getPigLogger, getReporter, getReturnType, getSchemaName, getSchemaType, isAsynchronous, progress, setInputSchema, setPigLogger, setReporter, setUDFContextSignature, warn
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

CubeDimensions

public CubeDimensions()

CubeDimensions

public CubeDimensions(String allMarker)
Method Detail

exec

public DataBag exec(Tuple tuple)
             throws IOException
Description copied from class: EvalFunc
This callback method must be implemented by all subclasses. This is the method that will be invoked on every Tuple of a given dataset. Since the dataset may be divided up in a variety of ways the programmer should not make assumptions about state that is maintained between invocations of this method.

Specified by:
exec in class EvalFunc<DataBag>
Parameters:
tuple - the Tuple to be processed.
Returns:
result, of type T.
Throws:
IOException

convertNullToUnknown

public static void convertNullToUnknown(Tuple tuple)
                                 throws ExecException
Throws:
ExecException

outputSchema

public Schema outputSchema(Schema input)
Description copied from class: EvalFunc
Report the schema of the output of this UDF. Pig will make use of this in error checking, optimization, and planning. The schema of input data to this UDF is provided.

The default implementation interprets the OutputSchema annotation, if one is present. Otherwise, it returns null (no known output schema).

Overrides:
outputSchema in class EvalFunc<DataBag>
Parameters:
input - Schema of the input
Returns:
Schema of the output


Copyright © 2007-2012 The Apache Software Foundation