org.apache.pig
Class LoadFunc

java.lang.Object
  extended by org.apache.pig.LoadFunc
Direct Known Subclasses:
DefaultIndexableLoader, FileInputLoadFunc, HBaseStorage, MergeJoinIndexer, ReadToEndLoader, RegExLoader, SampleLoader, TableLoader, TextLoader, XMLLoader

public abstract class LoadFunc
extends Object

LoadFunc provides functions directly associated with reading records from data set.


Constructor Summary
LoadFunc()
           
 
Method Summary
static String getAbsolutePath(String location, org.apache.hadoop.fs.Path curDir)
          Construct the absolute path from the file location and the current directory.
abstract  org.apache.hadoop.mapreduce.InputFormat getInputFormat()
          This will be called during planning on the front end.
 LoadCaster getLoadCaster()
          This will be called on the front end during planning and not on the back end during execution.
abstract  Tuple getNext()
          Retrieves the next tuple to be processed.
static String[] getPathStrings(String commaSeparatedPaths)
          Parse comma separated path strings into a string array.
static String join(AbstractCollection<String> s, String delimiter)
          Join multiple strings into a string delimited by the given delimiter.
abstract  void prepareToRead(org.apache.hadoop.mapreduce.RecordReader reader, PigSplit split)
          Initializes LoadFunc for reading data.
 String relativeToAbsolutePath(String location, org.apache.hadoop.fs.Path curDir)
          This method is called by the Pig runtime in the front end to convert the input location to an absolute path if the location is relative.
abstract  void setLocation(String location, org.apache.hadoop.mapreduce.Job job)
          Communicate to the loader the location of the object(s) being loaded.
 void setUDFContextSignature(String signature)
          This method will be called by Pig both in the front end and back end to pass a unique signature to the LoadFunc.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

LoadFunc

public LoadFunc()
Method Detail

relativeToAbsolutePath

public String relativeToAbsolutePath(String location,
                                     org.apache.hadoop.fs.Path curDir)
                              throws IOException
This method is called by the Pig runtime in the front end to convert the input location to an absolute path if the location is relative. The loadFunc implementation is free to choose how it converts a relative location to an absolute location since this may depend on what the location string represent (hdfs path or some other data source)

Parameters:
location - location as provided in the "load" statement of the script
curDir - the current working direction based on any "cd" statements in the script before the "load" statement. If there are no "cd" statements in the script, this would be the home directory -
/user/ 
Returns:
the absolute location based on the arguments passed
Throws:
IOException - if the conversion is not possible

setLocation

public abstract void setLocation(String location,
                                 org.apache.hadoop.mapreduce.Job job)
                          throws IOException
Communicate to the loader the location of the object(s) being loaded. The location string passed to the LoadFunc here is the return value of relativeToAbsolutePath(String, Path). Implementations should use this method to communicate the location (and any other information) to its underlying InputFormat through the Job object. This method will be called in the backend multiple times. Implementations should bear in mind that this method is called multiple times and should ensure there are no inconsistent side effects due to the multiple calls.

Parameters:
location - Location as returned by relativeToAbsolutePath(String, Path)
job - the Job object store or retrieve earlier stored information from the UDFContext
Throws:
IOException - if the location is not valid.

getInputFormat

public abstract org.apache.hadoop.mapreduce.InputFormat getInputFormat()
                                                                throws IOException
This will be called during planning on the front end. This is the instance of InputFormat (rather than the class name) because the load function may need to instantiate the InputFormat in order to control how it is constructed.

Returns:
the InputFormat associated with this loader.
Throws:
IOException - if there is an exception during InputFormat construction

getLoadCaster

public LoadCaster getLoadCaster()
                         throws IOException
This will be called on the front end during planning and not on the back end during execution.

Returns:
the LoadCaster associated with this loader. Returning null indicates that casts from byte array are not supported for this loader. construction
Throws:
IOException - if there is an exception during LoadCaster

prepareToRead

public abstract void prepareToRead(org.apache.hadoop.mapreduce.RecordReader reader,
                                   PigSplit split)
                            throws IOException
Initializes LoadFunc for reading data. This will be called during execution before any calls to getNext. The RecordReader needs to be passed here because it has been instantiated for a particular InputSplit.

Parameters:
reader - RecordReader to be used by this instance of the LoadFunc
split - The input PigSplit to process
Throws:
IOException - if there is an exception during initialization

getNext

public abstract Tuple getNext()
                       throws IOException
Retrieves the next tuple to be processed. Implementations should NOT reuse tuple objects (or inner member objects) they return across calls and should return a different tuple object in each call.

Returns:
the next tuple to be processed or null if there are no more tuples to be processed.
Throws:
IOException - if there is an exception while retrieving the next tuple

join

public static String join(AbstractCollection<String> s,
                          String delimiter)
Join multiple strings into a string delimited by the given delimiter.

Parameters:
s - a collection of strings
delimiter - the delimiter
Returns:
a 'delimiter' separated string

getPathStrings

public static String[] getPathStrings(String commaSeparatedPaths)
Parse comma separated path strings into a string array. This method escapes commas in the Hadoop glob pattern of the given paths. This method is borrowed from FileInputFormat. A jira (MAPREDUCE-1205) is opened to make the same name method there accessible. We'll use that method directly when the jira is fixed.

Parameters:
commaSeparatedPaths - a comma separated string
Returns:
an array of path strings

getAbsolutePath

public static String getAbsolutePath(String location,
                                     org.apache.hadoop.fs.Path curDir)
                              throws FrontendException
Construct the absolute path from the file location and the current directory. The current directory is either of the form {code}hdfs://:/{code} in Hadoop MapReduce mode, or of the form {code}file:///{code} in Hadoop local mode.

Parameters:
location - the location string specified in the load statement
curDir - the current file system directory
Returns:
the absolute path of file in the file system
Throws:
FrontendException - if the scheme of the location is incompatible with the scheme of the file system

setUDFContextSignature

public void setUDFContextSignature(String signature)
This method will be called by Pig both in the front end and back end to pass a unique signature to the LoadFunc. The signature can be used to store into the UDFContext any information which the LoadFunc needs to store between various method invocations in the front end and back end. A use case is to store LoadPushDown.RequiredFieldList passed to it in LoadPushDown.pushProjection(RequiredFieldList) for use in the back end before returning tuples in getNext(). This method will be call before other methods in LoadFunc

Parameters:
signature - a unique signature to identify this LoadFunc


Copyright © ${year} The Apache Software Foundation