org.apache.pig.impl.builtin
Class PartitionSkewedKeys

java.lang.Object
  extended by org.apache.pig.EvalFunc<Map<String,Object>>
      extended by org.apache.pig.impl.builtin.PartitionSkewedKeys

public class PartitionSkewedKeys
extends EvalFunc<Map<String,Object>>

Partition reducers for skewed keys. This is used in skewed join during sampling process. It figures out how many reducers required to process a skewed key without causing spill and allocate this number of reducers to this key. This UDF outputs a map which contains 2 keys:

  • "totalreducers": the value is an integer wich indicates the number of total reducers for this join job
  • "partition.list": the value is a bag which contains a list of tuples with each tuple representing partitions for a skewed key. The tuple has format of <join key>,<min index of reducer>, <max index of reducer>
  • For example, a join job configures 10 reducers, and the sampling process finds out 2 skewed keys, "swpv" needs 4 reducers and "swps" needs 2 reducers. The output file would be like following: {totalreducers=10, partition.list={(swpv,0,3), (swps,4,5)}} The name of this file is set into next MR job which does the actual join. That job uses this information to partition skewed keys properly


    Nested Class Summary
     
    Nested classes/interfaces inherited from class org.apache.pig.EvalFunc
    EvalFunc.SchemaType
     
    Field Summary
    static float DEFAULT_PERCENT_MEMUSAGE
               
    static String PARTITION_LIST
               
    static String TOTAL_REDUCERS
               
     
    Fields inherited from class org.apache.pig.EvalFunc
    pigLogger, reporter, returnType
     
    Constructor Summary
    PartitionSkewedKeys()
               
    PartitionSkewedKeys(String[] args)
               
     
    Method Summary
     Map<String,Object> exec(Tuple in)
              first field in the input tuple is the number of reducers second field is the *sorted* bag of samples this should be called only once
     
    Methods inherited from class org.apache.pig.EvalFunc
    finish, getArgToFuncMapping, getCacheFiles, getInputSchema, getLogger, getPigLogger, getReporter, getReturnType, getSchemaName, getSchemaType, isAsynchronous, outputSchema, progress, setInputSchema, setPigLogger, setReporter, setUDFContextSignature, warn
     
    Methods inherited from class java.lang.Object
    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
     

    Field Detail

    PARTITION_LIST

    public static final String PARTITION_LIST
    See Also:
    Constant Field Values

    TOTAL_REDUCERS

    public static final String TOTAL_REDUCERS
    See Also:
    Constant Field Values

    DEFAULT_PERCENT_MEMUSAGE

    public static final float DEFAULT_PERCENT_MEMUSAGE
    See Also:
    Constant Field Values
    Constructor Detail

    PartitionSkewedKeys

    public PartitionSkewedKeys()

    PartitionSkewedKeys

    public PartitionSkewedKeys(String[] args)
    Method Detail

    exec

    public Map<String,Object> exec(Tuple in)
                            throws IOException
    first field in the input tuple is the number of reducers second field is the *sorted* bag of samples this should be called only once

    Specified by:
    exec in class EvalFunc<Map<String,Object>>
    Parameters:
    in - the Tuple to be processed.
    Returns:
    result, of type T.
    Throws:
    IOException


    Copyright © 2007-2012 The Apache Software Foundation