org.apache.pig
Interface LoadMetadata

All Known Implementing Classes:
AllLoader, AvroStorage, AvroStorage, BinStorage, CSVExcelStorage, FixedWidthLoader, HiveColumnarLoader, HiveColumnarStorage, IndexedStorage, InterStorage, JsonLoader, JsonMetadata, JsonMetadata, LoadFuncMetadataWrapper, ParquetLoader, PigStorage, PigStorageSchema, ReadToEndLoader, SequenceFileInterStorage, Storage, TFileStorage, TrevniStorage

@InterfaceAudience.Public
@InterfaceStability.Evolving
public interface LoadMetadata

This interface defines how to retrieve metadata related to data to be loaded. If a given loader does not implement this interface, it will be assumed that it is unable to provide metadata about the associated data.

Since:
Pig 0.7

Method Summary
 String[] getPartitionKeys(String location, org.apache.hadoop.mapreduce.Job job)
          Find what columns are partition keys for this input.
 ResourceSchema getSchema(String location, org.apache.hadoop.mapreduce.Job job)
          Get a schema for the data to be loaded.
 ResourceStatistics getStatistics(String location, org.apache.hadoop.mapreduce.Job job)
          Get statistics about the data to be loaded.
 void setPartitionFilter(Expression partitionFilter)
          Set the filter for partitioning.
 

Method Detail

getSchema

ResourceSchema getSchema(String location,
                         org.apache.hadoop.mapreduce.Job job)
                         throws IOException
Get a schema for the data to be loaded.

Parameters:
location - Location as returned by LoadFunc.relativeToAbsolutePath(String, org.apache.hadoop.fs.Path)
job - The Job object - this should be used only to obtain cluster properties through JobContext.getConfiguration() and not to set/query any runtime job information.
Returns:
schema for the data to be loaded. This schema should represent all tuples of the returned data. If the schema is unknown or it is not possible to return a schema that represents all returned data, then null should be returned. The schema should not be affected by pushProjection, ie. getSchema should always return the original schema even after pushProjection
Throws:
IOException - if an exception occurs while determining the schema

getStatistics

ResourceStatistics getStatistics(String location,
                                 org.apache.hadoop.mapreduce.Job job)
                                 throws IOException
Get statistics about the data to be loaded. If no statistics are available, then null should be returned. If the implementing class also extends LoadFunc, then LoadFunc.setLocation(String, org.apache.hadoop.mapreduce.Job) is guaranteed to be called before this method.

Parameters:
location - Location as returned by LoadFunc.relativeToAbsolutePath(String, org.apache.hadoop.fs.Path)
job - The Job object - this should be used only to obtain cluster properties through JobContext.getConfiguration() and not to set/query any runtime job information.
Returns:
statistics about the data to be loaded. If no statistics are available, then null should be returned.
Throws:
IOException - if an exception occurs while retrieving statistics

getPartitionKeys

String[] getPartitionKeys(String location,
                          org.apache.hadoop.mapreduce.Job job)
                          throws IOException
Find what columns are partition keys for this input.

Parameters:
location - Location as returned by LoadFunc.relativeToAbsolutePath(String, org.apache.hadoop.fs.Path)
job - The Job object - this should be used only to obtain cluster properties through JobContext.getConfiguration() and not to set/query any runtime job information.
Returns:
array of field names of the partition keys. Implementations should return null to indicate that there are no partition keys
Throws:
IOException - if an exception occurs while retrieving partition keys

setPartitionFilter

void setPartitionFilter(Expression partitionFilter)
                        throws IOException
Set the filter for partitioning. It is assumed that this filter will only contain references to fields given as partition keys in getPartitionKeys. So if the implementation returns null in getPartitionKeys(String, Job), then this method is not called by Pig runtime. This method is also not called by the Pig runtime if there are no partition filter conditions.

Parameters:
partitionFilter - that describes filter for partitioning
Throws:
IOException - if the filter is not compatible with the storage mechanism or contains non-partition fields.


Copyright © 2007-2012 The Apache Software Foundation