org.apache.pig.piggybank.storage
Class PigStorageSchema

java.lang.Object
  extended by org.apache.pig.LoadFunc
      extended by org.apache.pig.FileInputLoadFunc
          extended by org.apache.pig.builtin.PigStorage
              extended by org.apache.pig.piggybank.storage.PigStorageSchema
All Implemented Interfaces:
LoadMetadata, LoadPushDown, OrderedLoadFunc, StoreFuncInterface, StoreMetadata

public class PigStorageSchema
extends PigStorage
implements LoadMetadata, StoreMetadata

This Load/Store Func reads/writes metafiles that allow the schema and aliases to be determined at load time, saving one from having to manually enter schemas for pig-generated datasets. It also creates a ".pig_headers" file that simply lists the delimited aliases. This is intended to make export to tools that can read files with header lines easier (just cat the header to your data). Due to StoreFunc limitations, you can only write the metafiles in MapReduce mode. You can read them in Local or MapReduce mode.


Nested Class Summary
 
Nested classes/interfaces inherited from interface org.apache.pig.LoadPushDown
LoadPushDown.OperatorSet, LoadPushDown.RequiredField, LoadPushDown.RequiredFieldList, LoadPushDown.RequiredFieldResponse
 
Field Summary
 
Fields inherited from class org.apache.pig.builtin.PigStorage
in, mLog, mRequiredColumns, signature, writer
 
Constructor Summary
PigStorageSchema()
           
PigStorageSchema(String delim)
           
 
Method Summary
 Tuple getNext()
          Retrieves the next tuple to be processed.
 String[] getPartitionKeys(String location, org.apache.hadoop.mapreduce.Job job)
          Find what columns are partition keys for this input.
 ResourceSchema getSchema(String location, org.apache.hadoop.mapreduce.Job job)
          Get a schema for the data to be loaded.
 ResourceStatistics getStatistics(String location, org.apache.hadoop.mapreduce.Job job)
          Get statistics about the data to be loaded.
 void setPartitionFilter(Expression partitionFilter)
          Set the filter for partitioning.
 void storeSchema(ResourceSchema schema, String location, org.apache.hadoop.mapreduce.Job job)
          Store schema of the data being written
 void storeStatistics(ResourceStatistics stats, String location, org.apache.hadoop.mapreduce.Job job)
          Store statistics about the data being written.
 
Methods inherited from class org.apache.pig.builtin.PigStorage
checkSchema, cleanupOnFailure, equals, equals, getFeatures, getInputFormat, getOutputFormat, hashCode, prepareToRead, prepareToWrite, pushProjection, putNext, relToAbsPathForStoreLocation, setLocation, setStoreFuncUDFContextSignature, setStoreLocation, setUDFContextSignature
 
Methods inherited from class org.apache.pig.FileInputLoadFunc
getSplitComparable
 
Methods inherited from class org.apache.pig.LoadFunc
getAbsolutePath, getLoadCaster, getPathStrings, join, relativeToAbsolutePath
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

PigStorageSchema

public PigStorageSchema()

PigStorageSchema

public PigStorageSchema(String delim)
Method Detail

getNext

public Tuple getNext()
              throws IOException
Description copied from class: LoadFunc
Retrieves the next tuple to be processed. Implementations should NOT reuse tuple objects (or inner member objects) they return across calls and should return a different tuple object in each call.

Overrides:
getNext in class PigStorage
Returns:
the next tuple to be processed or null if there are no more tuples to be processed.
Throws:
IOException - if there is an exception while retrieving the next tuple

getSchema

public ResourceSchema getSchema(String location,
                                org.apache.hadoop.mapreduce.Job job)
                         throws IOException
Description copied from interface: LoadMetadata
Get a schema for the data to be loaded.

Specified by:
getSchema in interface LoadMetadata
Parameters:
location - Location as returned by LoadFunc.relativeToAbsolutePath(String, org.apache.hadoop.fs.Path)
job - The Job object - this should be used only to obtain cluster properties through JobContext.getConfiguration() and not to set/query any runtime job information.
Returns:
schema for the data to be loaded. This schema should represent all tuples of the returned data. If the schema is unknown or it is not possible to return a schema that represents all returned data, then null should be returned. The schema should not be affected by pushProjection, ie. getSchema should always return the original schema even after pushProjection
Throws:
IOException - if an exception occurs while determining the schema

getStatistics

public ResourceStatistics getStatistics(String location,
                                        org.apache.hadoop.mapreduce.Job job)
                                 throws IOException
Description copied from interface: LoadMetadata
Get statistics about the data to be loaded. If no statistics are available, then null should be returned.

Specified by:
getStatistics in interface LoadMetadata
Parameters:
location - Location as returned by LoadFunc.relativeToAbsolutePath(String, org.apache.hadoop.fs.Path)
job - The Job object - this should be used only to obtain cluster properties through JobContext.getConfiguration() and not to set/query any runtime job information.
Returns:
statistics about the data to be loaded. If no statistics are available, then null should be returned.
Throws:
IOException - if an exception occurs while retrieving statistics

setPartitionFilter

public void setPartitionFilter(Expression partitionFilter)
                        throws IOException
Description copied from interface: LoadMetadata
Set the filter for partitioning. It is assumed that this filter will only contain references to fields given as partition keys in getPartitionKeys. So if the implementation returns null in LoadMetadata.getPartitionKeys(String, Job), then this method is not called by Pig runtime. This method is also not called by the Pig runtime if there are no partition filter conditions.

Specified by:
setPartitionFilter in interface LoadMetadata
Parameters:
partitionFilter - that describes filter for partitioning
Throws:
IOException - if the filter is not compatible with the storage mechanism or contains non-partition fields.

getPartitionKeys

public String[] getPartitionKeys(String location,
                                 org.apache.hadoop.mapreduce.Job job)
                          throws IOException
Description copied from interface: LoadMetadata
Find what columns are partition keys for this input.

Specified by:
getPartitionKeys in interface LoadMetadata
Parameters:
location - Location as returned by LoadFunc.relativeToAbsolutePath(String, org.apache.hadoop.fs.Path)
job - The Job object - this should be used only to obtain cluster properties through JobContext.getConfiguration() and not to set/query any runtime job information.
Returns:
array of field names of the partition keys. Implementations should return null to indicate that there are no partition keys
Throws:
IOException - if an exception occurs while retrieving partition keys

storeSchema

public void storeSchema(ResourceSchema schema,
                        String location,
                        org.apache.hadoop.mapreduce.Job job)
                 throws IOException
Description copied from interface: StoreMetadata
Store schema of the data being written

Specified by:
storeSchema in interface StoreMetadata
Parameters:
schema - Schema to be recorded
location - Location as returned by LoadFunc.relativeToAbsolutePath(String, org.apache.hadoop.fs.Path)
job - The Job object - this should be used only to obtain cluster properties through JobContext.getConfiguration() and not to set/query any runtime job information.
Throws:
IOException

storeStatistics

public void storeStatistics(ResourceStatistics stats,
                            String location,
                            org.apache.hadoop.mapreduce.Job job)
                     throws IOException
Description copied from interface: StoreMetadata
Store statistics about the data being written.

Specified by:
storeStatistics in interface StoreMetadata
Parameters:
stats - statistics to be recorded
location - Location as returned by LoadFunc.relativeToAbsolutePath(String, org.apache.hadoop.fs.Path)
job - The Job object - this should be used only to obtain cluster properties through JobContext.getConfiguration() and not to set/query any runtime job information.
Throws:
IOException


Copyright © ${year} The Apache Software Foundation