org.apache.pig.backend.hadoop.accumulo
Class AbstractAccumuloStorage

java.lang.Object
  extended by org.apache.pig.LoadFunc
      extended by org.apache.pig.backend.hadoop.accumulo.AbstractAccumuloStorage
All Implemented Interfaces:
StoreFuncInterface
Direct Known Subclasses:
AccumuloStorage

public abstract class AbstractAccumuloStorage
extends LoadFunc
implements StoreFuncInterface

A LoadStoreFunc for retrieving data from and storing data to Accumulo A Key/Val pair will be returned as tuples: (key, colfam, colqual, colvis, timestamp, value). All fields except timestamp are DataByteArray, timestamp is a long. Tuples can be written in 2 forms: (key, colfam, colqual, colvis, value) OR (key, colfam, colqual, value)


Field Summary
protected static String ASTERISK
           
protected  org.apache.accumulo.core.security.Authorizations authorizations
           
protected  LoadStoreCaster caster
           
protected static char COLON
           
protected  List<Column> columns
           
protected  String columnSeparator
           
protected static char COMMA
           
protected  org.apache.commons.cli.CommandLine commandLine
           
protected  String contextSignature
           
protected  String end
           
protected  boolean ignoreWhitespace
           
protected  String inst
           
protected  long maxLatency
           
protected  long maxMutationBufferSize
           
protected  int maxWriteThreads
           
protected  String password
           
protected  ResourceSchema schema
           
protected  String start
           
protected  AccumuloStorageOptions storageOptions
           
protected  String table
           
protected  org.apache.hadoop.io.Text tableName
           
protected  String user
           
protected  String zookeepers
           
 
Constructor Summary
AbstractAccumuloStorage(String columns, String args)
           
 
Method Summary
 void checkSchema(ResourceSchema s)
          Set the schema for data to be stored.
 void cleanupOnFailure(String failure, org.apache.hadoop.mapreduce.Job job)
          This method will be called by Pig if the job which contains this store fails.
 void cleanupOnSuccess(String location, org.apache.hadoop.mapreduce.Job job)
          This method will be called by Pig if the job which contains this store is successful, and some cleanup of intermediate resources is required.
protected  void clearUnset(org.apache.hadoop.conf.Configuration conf, Map<String,String> entriesToUnset)
          Replaces the given entries in the configuration by clearing the Configuration and re-adding the elements that aren't in the Map of entries to unset
protected  void configureInputFormat(org.apache.hadoop.mapreduce.Job job)
          Method to allow specific implementations to add more elements to the Job for reading data from Accumulo.
protected  void configureOutputFormat(org.apache.hadoop.mapreduce.Job job)
          Method to allow specific implementations to add more elements to the Job for writing data to Accumulo.
protected  void extractArgs(org.apache.commons.cli.CommandLine cli, AccumuloStorageOptions opts)
          Extract arguments passed into the constructor to avoid the URI
protected  org.apache.commons.cli.CommandLine getCommandLine()
           
protected  Map<String,String> getEntries(org.apache.hadoop.conf.Configuration conf, String prefix)
          Extract elements from the Configuration whose keys match the given prefix
 org.apache.hadoop.mapreduce.InputFormat getInputFormat()
          This will be called during planning on the front end.
protected  Map<String,String> getInputFormatEntries(org.apache.hadoop.conf.Configuration conf)
           
 LoadCaster getLoadCaster()
          This will be called on the front end during planning and not on the back end during execution.
protected abstract  Collection<org.apache.accumulo.core.data.Mutation> getMutations(Tuple tuple)
           
 Tuple getNext()
          Retrieves the next tuple to be processed.
 org.apache.hadoop.mapreduce.OutputFormat getOutputFormat()
          Return the OutputFormat associated with StoreFuncInterface.
protected  Map<String,String> getOutputFormatEntries(org.apache.hadoop.conf.Configuration conf)
           
protected abstract  Tuple getTuple(org.apache.accumulo.core.data.Key key, org.apache.accumulo.core.data.Value value)
           
protected  Properties getUDFProperties()
          Returns UDFProperties based on contextSignature.
protected  org.apache.hadoop.mapreduce.RecordWriter<org.apache.hadoop.io.Text,org.apache.accumulo.core.data.Mutation> getWriter()
           
protected  void loadDependentJars(org.apache.hadoop.conf.Configuration conf)
          Ensure that Accumulo's dependent jars are added to the Configuration to alleviate the need for clients to REGISTER dependency jars.
protected  org.apache.accumulo.core.util.Pair<org.apache.hadoop.io.Text,org.apache.hadoop.io.Text> makePair(String first, String second)
           
protected  org.apache.hadoop.io.Text objectToText(Object o, ResourceSchema.ResourceFieldSchema fieldSchema)
           
protected  byte[] objToBytes(Object o, byte type)
           
protected  org.apache.hadoop.io.Text objToText(Object o, byte type)
           
 void prepareToRead(org.apache.hadoop.mapreduce.RecordReader reader, PigSplit split)
          Initializes LoadFunc for reading data.
 void prepareToWrite(org.apache.hadoop.mapreduce.RecordWriter writer)
          Initialize StoreFuncInterface to write data.
 void putNext(Tuple tuple)
          Write a tuple to the data store.
 String relativeToAbsolutePath(String location, org.apache.hadoop.fs.Path curDir)
          This method is called by the Pig runtime in the front end to convert the input location to an absolute path if the location is relative.
 String relToAbsPathForStoreLocation(String location, org.apache.hadoop.fs.Path curDir)
          This method is called by the Pig runtime in the front end to convert the output location to an absolute path if the location is relative.
protected  byte schemaToType(Object o, int i, ResourceSchema.ResourceFieldSchema[] fieldSchemas)
           
protected  byte schemaToType(Object o, ResourceSchema.ResourceFieldSchema fieldSchema)
           
 void setLocation(String location, org.apache.hadoop.mapreduce.Job job)
          Communicate to the loader the location of the object(s) being loaded.
 void setStoreFuncUDFContextSignature(String signature)
          This method will be called by Pig both in the front end and back end to pass a unique signature to the StoreFuncInterface which it can use to store information in the UDFContext which it needs to store between various method invocations in the front end and back end.
 void setStoreLocation(String location, org.apache.hadoop.mapreduce.Job job)
          Communicate to the storer the location where the data needs to be stored.
 void setUDFContextSignature(String signature)
          This method will be called by Pig both in the front end and back end to pass a unique signature to the LoadFunc.
protected  void simpleUnset(org.apache.hadoop.conf.Configuration conf, Map<String,String> entriesToUnset)
          Unsets elements in the Configuration using the unset method
protected  byte[] tupleToBytes(Tuple tuple, int i, ResourceSchema.ResourceFieldSchema[] fieldSchemas)
           
protected  org.apache.hadoop.io.Text tupleToText(Tuple tuple, int i, ResourceSchema.ResourceFieldSchema[] fieldSchemas)
           
protected  void unsetEntriesFromConfiguration(org.apache.hadoop.conf.Configuration conf, Map<String,String> entriesToUnset)
          Removes the given values from the configuration, accounting for changes in the Configuration API given the version of Hadoop being used.
 
Methods inherited from class org.apache.pig.LoadFunc
getAbsolutePath, getPathStrings, join, warn
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

COLON

protected static final char COLON
See Also:
Constant Field Values

COMMA

protected static final char COMMA
See Also:
Constant Field Values

ASTERISK

protected static final String ASTERISK
See Also:
Constant Field Values

storageOptions

protected final AccumuloStorageOptions storageOptions

commandLine

protected final org.apache.commons.cli.CommandLine commandLine

inst

protected String inst

zookeepers

protected String zookeepers

user

protected String user

password

protected String password

table

protected String table

tableName

protected org.apache.hadoop.io.Text tableName

authorizations

protected org.apache.accumulo.core.security.Authorizations authorizations

columns

protected List<Column> columns

start

protected String start

end

protected String end

maxWriteThreads

protected int maxWriteThreads

maxMutationBufferSize

protected long maxMutationBufferSize

maxLatency

protected long maxLatency

columnSeparator

protected String columnSeparator

ignoreWhitespace

protected boolean ignoreWhitespace

caster

protected LoadStoreCaster caster

schema

protected ResourceSchema schema

contextSignature

protected String contextSignature
Constructor Detail

AbstractAccumuloStorage

public AbstractAccumuloStorage(String columns,
                               String args)
                        throws org.apache.commons.cli.ParseException,
                               IOException
Throws:
org.apache.commons.cli.ParseException
IOException
Method Detail

extractArgs

protected void extractArgs(org.apache.commons.cli.CommandLine cli,
                           AccumuloStorageOptions opts)
                    throws IOException
Extract arguments passed into the constructor to avoid the URI

Parameters:
cli -
opts -
Throws:
IOException

getCommandLine

protected org.apache.commons.cli.CommandLine getCommandLine()

getInputFormatEntries

protected Map<String,String> getInputFormatEntries(org.apache.hadoop.conf.Configuration conf)

getOutputFormatEntries

protected Map<String,String> getOutputFormatEntries(org.apache.hadoop.conf.Configuration conf)

unsetEntriesFromConfiguration

protected void unsetEntriesFromConfiguration(org.apache.hadoop.conf.Configuration conf,
                                             Map<String,String> entriesToUnset)
Removes the given values from the configuration, accounting for changes in the Configuration API given the version of Hadoop being used.

Parameters:
conf -
entriesToUnset -

simpleUnset

protected void simpleUnset(org.apache.hadoop.conf.Configuration conf,
                           Map<String,String> entriesToUnset)
Unsets elements in the Configuration using the unset method

Parameters:
conf -
entriesToUnset -

clearUnset

protected void clearUnset(org.apache.hadoop.conf.Configuration conf,
                          Map<String,String> entriesToUnset)
Replaces the given entries in the configuration by clearing the Configuration and re-adding the elements that aren't in the Map of entries to unset

Parameters:
conf -
entriesToUnset -

getNext

public Tuple getNext()
              throws IOException
Description copied from class: LoadFunc
Retrieves the next tuple to be processed. Implementations should NOT reuse tuple objects (or inner member objects) they return across calls and should return a different tuple object in each call.

Specified by:
getNext in class LoadFunc
Returns:
the next tuple to be processed or null if there are no more tuples to be processed.
Throws:
IOException - if there is an exception while retrieving the next tuple

getTuple

protected abstract Tuple getTuple(org.apache.accumulo.core.data.Key key,
                                  org.apache.accumulo.core.data.Value value)
                           throws IOException
Throws:
IOException

getInputFormat

public org.apache.hadoop.mapreduce.InputFormat getInputFormat()
Description copied from class: LoadFunc
This will be called during planning on the front end. This is the instance of InputFormat (rather than the class name) because the load function may need to instantiate the InputFormat in order to control how it is constructed.

Specified by:
getInputFormat in class LoadFunc
Returns:
the InputFormat associated with this loader.

prepareToRead

public void prepareToRead(org.apache.hadoop.mapreduce.RecordReader reader,
                          PigSplit split)
Description copied from class: LoadFunc
Initializes LoadFunc for reading data. This will be called during execution before any calls to getNext. The RecordReader needs to be passed here because it has been instantiated for a particular InputSplit.

Specified by:
prepareToRead in class LoadFunc
Parameters:
reader - RecordReader to be used by this instance of the LoadFunc
split - The input PigSplit to process

getWriter

protected org.apache.hadoop.mapreduce.RecordWriter<org.apache.hadoop.io.Text,org.apache.accumulo.core.data.Mutation> getWriter()

getEntries

protected Map<String,String> getEntries(org.apache.hadoop.conf.Configuration conf,
                                        String prefix)
Extract elements from the Configuration whose keys match the given prefix

Parameters:
conf -
prefix -
Returns:

setLocation

public void setLocation(String location,
                        org.apache.hadoop.mapreduce.Job job)
                 throws IOException
Description copied from class: LoadFunc
Communicate to the loader the location of the object(s) being loaded. The location string passed to the LoadFunc here is the return value of LoadFunc.relativeToAbsolutePath(String, Path). Implementations should use this method to communicate the location (and any other information) to its underlying InputFormat through the Job object. This method will be called in the frontend and backend multiple times. Implementations should bear in mind that this method is called multiple times and should ensure there are no inconsistent side effects due to the multiple calls.

Specified by:
setLocation in class LoadFunc
Parameters:
location - Location as returned by LoadFunc.relativeToAbsolutePath(String, Path)
job - the Job object store or retrieve earlier stored information from the UDFContext
Throws:
IOException - if the location is not valid.

makePair

protected org.apache.accumulo.core.util.Pair<org.apache.hadoop.io.Text,org.apache.hadoop.io.Text> makePair(String first,
                                                                                                           String second)

loadDependentJars

protected void loadDependentJars(org.apache.hadoop.conf.Configuration conf)
                          throws IOException
Ensure that Accumulo's dependent jars are added to the Configuration to alleviate the need for clients to REGISTER dependency jars.

Parameters:
job - The Mapreduce Job object
Throws:
IOException

configureInputFormat

protected void configureInputFormat(org.apache.hadoop.mapreduce.Job job)
Method to allow specific implementations to add more elements to the Job for reading data from Accumulo.

Parameters:
job -

configureOutputFormat

protected void configureOutputFormat(org.apache.hadoop.mapreduce.Job job)
Method to allow specific implementations to add more elements to the Job for writing data to Accumulo.

Parameters:
job -

relativeToAbsolutePath

public String relativeToAbsolutePath(String location,
                                     org.apache.hadoop.fs.Path curDir)
                              throws IOException
Description copied from class: LoadFunc
This method is called by the Pig runtime in the front end to convert the input location to an absolute path if the location is relative. The loadFunc implementation is free to choose how it converts a relative location to an absolute location since this may depend on what the location string represent (hdfs path or some other data source)

Overrides:
relativeToAbsolutePath in class LoadFunc
Parameters:
location - location as provided in the "load" statement of the script
curDir - the current working direction based on any "cd" statements in the script before the "load" statement. If there are no "cd" statements in the script, this would be the home directory -
/user/ 
Returns:
the absolute location based on the arguments passed
Throws:
IOException - if the conversion is not possible

setUDFContextSignature

public void setUDFContextSignature(String signature)
Description copied from class: LoadFunc
This method will be called by Pig both in the front end and back end to pass a unique signature to the LoadFunc. The signature can be used to store into the UDFContext any information which the LoadFunc needs to store between various method invocations in the front end and back end. A use case is to store LoadPushDown.RequiredFieldList passed to it in LoadPushDown.pushProjection(RequiredFieldList) for use in the back end before returning tuples in LoadFunc.getNext(). This method will be call before other methods in LoadFunc

Overrides:
setUDFContextSignature in class LoadFunc
Parameters:
signature - a unique signature to identify this LoadFunc

setStoreFuncUDFContextSignature

public void setStoreFuncUDFContextSignature(String signature)
Description copied from interface: StoreFuncInterface
This method will be called by Pig both in the front end and back end to pass a unique signature to the StoreFuncInterface which it can use to store information in the UDFContext which it needs to store between various method invocations in the front end and back end. This is necessary because in a Pig Latin script with multiple stores, the different instances of store functions need to be able to find their (and only their) data in the UDFContext object.

Specified by:
setStoreFuncUDFContextSignature in interface StoreFuncInterface
Parameters:
signature - a unique signature to identify this StoreFuncInterface

getUDFProperties

protected Properties getUDFProperties()
Returns UDFProperties based on contextSignature.


relToAbsPathForStoreLocation

public String relToAbsPathForStoreLocation(String location,
                                           org.apache.hadoop.fs.Path curDir)
                                    throws IOException
Description copied from interface: StoreFuncInterface
This method is called by the Pig runtime in the front end to convert the output location to an absolute path if the location is relative. The StoreFuncInterface implementation is free to choose how it converts a relative location to an absolute location since this may depend on what the location string represent (hdfs path or some other data source). The static method LoadFunc.getAbsolutePath(java.lang.String, org.apache.hadoop.fs.Path) provides a default implementation for hdfs and hadoop local file system and it can be used to implement this method.

Specified by:
relToAbsPathForStoreLocation in interface StoreFuncInterface
Parameters:
location - location as provided in the "store" statement of the script
curDir - the current working direction based on any "cd" statements in the script before the "store" statement. If there are no "cd" statements in the script, this would be the home directory -
/user/ 
Returns:
the absolute location based on the arguments passed
Throws:
IOException - if the conversion is not possible

setStoreLocation

public void setStoreLocation(String location,
                             org.apache.hadoop.mapreduce.Job job)
                      throws IOException
Description copied from interface: StoreFuncInterface
Communicate to the storer the location where the data needs to be stored. The location string passed to the StoreFuncInterface here is the return value of StoreFuncInterface.relToAbsPathForStoreLocation(String, Path) This method will be called in the frontend and backend multiple times. Implementations should bear in mind that this method is called multiple times and should ensure there are no inconsistent side effects due to the multiple calls. StoreFuncInterface.checkSchema(ResourceSchema) will be called before any call to StoreFuncInterface.setStoreLocation(String, Job).

Specified by:
setStoreLocation in interface StoreFuncInterface
Parameters:
location - Location returned by StoreFuncInterface.relToAbsPathForStoreLocation(String, Path)
job - The Job object
Throws:
IOException - if the location is not valid.

getOutputFormat

public org.apache.hadoop.mapreduce.OutputFormat getOutputFormat()
Description copied from interface: StoreFuncInterface
Return the OutputFormat associated with StoreFuncInterface. This will be called on the front end during planning and on the backend during execution.

Specified by:
getOutputFormat in interface StoreFuncInterface
Returns:
the OutputFormat associated with StoreFuncInterface

prepareToWrite

public void prepareToWrite(org.apache.hadoop.mapreduce.RecordWriter writer)
Description copied from interface: StoreFuncInterface
Initialize StoreFuncInterface to write data. This will be called during execution before the call to putNext.

Specified by:
prepareToWrite in interface StoreFuncInterface
Parameters:
writer - RecordWriter to use.

getMutations

protected abstract Collection<org.apache.accumulo.core.data.Mutation> getMutations(Tuple tuple)
                                                                            throws ExecException,
                                                                                   IOException
Throws:
ExecException
IOException

putNext

public void putNext(Tuple tuple)
             throws ExecException,
                    IOException
Description copied from interface: StoreFuncInterface
Write a tuple to the data store.

Specified by:
putNext in interface StoreFuncInterface
Parameters:
tuple - the tuple to store.
Throws:
IOException - if an exception occurs during the write
ExecException

cleanupOnFailure

public void cleanupOnFailure(String failure,
                             org.apache.hadoop.mapreduce.Job job)
Description copied from interface: StoreFuncInterface
This method will be called by Pig if the job which contains this store fails. Implementations can clean up output locations in this method to ensure that no incorrect/incomplete results are left in the output location

Specified by:
cleanupOnFailure in interface StoreFuncInterface
Parameters:
failure - Location returned by StoreFuncInterface.relToAbsPathForStoreLocation(String, Path)
job - The Job object - this should be used only to obtain cluster properties through JobContext.getConfiguration() and not to set/query any runtime job information.

cleanupOnSuccess

public void cleanupOnSuccess(String location,
                             org.apache.hadoop.mapreduce.Job job)
Description copied from interface: StoreFuncInterface
This method will be called by Pig if the job which contains this store is successful, and some cleanup of intermediate resources is required. Implementations can clean up output locations in this method to ensure that no incorrect/incomplete results are left in the output location

Specified by:
cleanupOnSuccess in interface StoreFuncInterface
Parameters:
location - Location returned by StoreFuncInterface.relToAbsPathForStoreLocation(String, Path)
job - The Job object - this should be used only to obtain cluster properties through JobContext.getConfiguration() and not to set/query any runtime job information.

checkSchema

public void checkSchema(ResourceSchema s)
                 throws IOException
Description copied from interface: StoreFuncInterface
Set the schema for data to be stored. This will be called on the front end during planning if the store is associated with a schema. A Store function should implement this function to check that a given schema is acceptable to it. For example, it can check that the correct partition keys are included; a storage function to be written directly to an OutputFormat can make sure the schema will translate in a well defined way.

Specified by:
checkSchema in interface StoreFuncInterface
Parameters:
s - to be checked
Throws:
IOException - if this schema is not acceptable. It should include a detailed error message indicating what is wrong with the schema.

tupleToText

protected org.apache.hadoop.io.Text tupleToText(Tuple tuple,
                                                int i,
                                                ResourceSchema.ResourceFieldSchema[] fieldSchemas)
                                         throws IOException
Throws:
IOException

objectToText

protected org.apache.hadoop.io.Text objectToText(Object o,
                                                 ResourceSchema.ResourceFieldSchema fieldSchema)
                                          throws IOException
Throws:
IOException

schemaToType

protected byte schemaToType(Object o,
                            ResourceSchema.ResourceFieldSchema fieldSchema)

schemaToType

protected byte schemaToType(Object o,
                            int i,
                            ResourceSchema.ResourceFieldSchema[] fieldSchemas)

tupleToBytes

protected byte[] tupleToBytes(Tuple tuple,
                              int i,
                              ResourceSchema.ResourceFieldSchema[] fieldSchemas)
                       throws IOException
Throws:
IOException

objToText

protected org.apache.hadoop.io.Text objToText(Object o,
                                              byte type)
                                       throws IOException
Throws:
IOException

objToBytes

protected byte[] objToBytes(Object o,
                            byte type)
                     throws IOException
Throws:
IOException

getLoadCaster

public LoadCaster getLoadCaster()
                         throws IOException
Description copied from class: LoadFunc
This will be called on the front end during planning and not on the back end during execution.

Overrides:
getLoadCaster in class LoadFunc
Returns:
the LoadCaster associated with this loader. Returning null indicates that casts from byte array are not supported for this loader. construction
Throws:
IOException - if there is an exception during LoadCaster


Copyright © 2007-2012 The Apache Software Foundation