org.apache.pig.piggybank.storage
Class JsonMetadata

java.lang.Object
  extended by org.apache.pig.piggybank.storage.JsonMetadata
All Implemented Interfaces:
LoadMetadata, StoreMetadata

public class JsonMetadata
extends Object
implements LoadMetadata, StoreMetadata

Reads and Writes metadata using JSON in metafiles next to the data.


Constructor Summary
JsonMetadata()
           
 
Method Summary
protected  Set<ElementDescriptor> findMetaFile(String path, String prefix, org.apache.hadoop.conf.Configuration conf)
          .
 String[] getPartitionKeys(String location, org.apache.hadoop.mapreduce.Job job)
          Find what columns are partition keys for this input.
 ResourceSchema getSchema(String location, org.apache.hadoop.mapreduce.Job job)
          For JsonMetadata schema is considered optional This method suppresses (and logs) errors if they are encountered.
 ResourceStatistics getStatistics(String location, org.apache.hadoop.mapreduce.Job job)
          For JsonMetadata stats are considered optional This method suppresses (and logs) errors if they are encountered.
 void setFieldDel(byte fieldDel)
           
 void setPartitionFilter(Expression partitionFilter)
          Set the filter for partitioning.
 void setRecordDel(byte recordDel)
           
 void storeSchema(ResourceSchema schema, String location, org.apache.hadoop.mapreduce.Job job)
          Store schema of the data being written
 void storeStatistics(ResourceStatistics stats, String location, org.apache.hadoop.mapreduce.Job job)
          Store statistics about the data being written.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

JsonMetadata

public JsonMetadata()
Method Detail

findMetaFile

protected Set<ElementDescriptor> findMetaFile(String path,
                                              String prefix,
                                              org.apache.hadoop.conf.Configuration conf)
                                       throws IOException
. Given a path, which may represent a glob pattern, a directory, or a file, this method finds the set of relevant metadata files on the storage system. The algorithm for finding the metadata file is as follows:

For each file represented by the path (either directly, or via a glob): If parentPath/prefix.fileName exists, use that as the metadata file. Else if parentPath/prefix exists, use that as the metadata file.

Resolving conflicts, merging the metadata, etc, is not handled by this method and should be taken care of by downstream code. This can go into a util package if metadata files are considered a general enough pattern

Parameters:
path - Path, as passed in to a LoadFunc (may be a Hadoop glob)
prefix - Metadata file designation, such as .pig_schema or .pig_stats
conf - configuration object
Returns:
Set of element descriptors for all metadata files associated with the files on the path.
Throws:
IOException

getPartitionKeys

public String[] getPartitionKeys(String location,
                                 org.apache.hadoop.mapreduce.Job job)
Description copied from interface: LoadMetadata
Find what columns are partition keys for this input.

Specified by:
getPartitionKeys in interface LoadMetadata
Parameters:
location - Location as returned by LoadFunc.relativeToAbsolutePath(String, org.apache.hadoop.fs.Path)
job - The Job object - this should be used only to obtain cluster properties through JobContext.getConfiguration() and not to set/query any runtime job information.
Returns:
array of field names of the partition keys. Implementations should return null to indicate that there are no partition keys

setPartitionFilter

public void setPartitionFilter(Expression partitionFilter)
                        throws IOException
Description copied from interface: LoadMetadata
Set the filter for partitioning. It is assumed that this filter will only contain references to fields given as partition keys in getPartitionKeys. So if the implementation returns null in LoadMetadata.getPartitionKeys(String, Job), then this method is not called by Pig runtime. This method is also not called by the Pig runtime if there are no partition filter conditions.

Specified by:
setPartitionFilter in interface LoadMetadata
Parameters:
partitionFilter - that describes filter for partitioning
Throws:
IOException - if the filter is not compatible with the storage mechanism or contains non-partition fields.

getSchema

public ResourceSchema getSchema(String location,
                                org.apache.hadoop.mapreduce.Job job)
                         throws IOException
For JsonMetadata schema is considered optional This method suppresses (and logs) errors if they are encountered.

Specified by:
getSchema in interface LoadMetadata
Parameters:
location - Location as returned by LoadFunc.relativeToAbsolutePath(String, org.apache.hadoop.fs.Path)
job - The Job object - this should be used only to obtain cluster properties through JobContext.getConfiguration() and not to set/query any runtime job information.
Returns:
schema for the data to be loaded. This schema should represent all tuples of the returned data. If the schema is unknown or it is not possible to return a schema that represents all returned data, then null should be returned. The schema should not be affected by pushProjection, ie. getSchema should always return the original schema even after pushProjection
Throws:
IOException - if an exception occurs while determining the schema

getStatistics

public ResourceStatistics getStatistics(String location,
                                        org.apache.hadoop.mapreduce.Job job)
                                 throws IOException
For JsonMetadata stats are considered optional This method suppresses (and logs) errors if they are encountered.

Specified by:
getStatistics in interface LoadMetadata
Parameters:
location - Location as returned by LoadFunc.relativeToAbsolutePath(String, org.apache.hadoop.fs.Path)
job - The Job object - this should be used only to obtain cluster properties through JobContext.getConfiguration() and not to set/query any runtime job information.
Returns:
statistics about the data to be loaded. If no statistics are available, then null should be returned.
Throws:
IOException - if an exception occurs while retrieving statistics
See Also:
org.apache.pig.LoadMetadata#getStatistics(String, Configuration)

storeStatistics

public void storeStatistics(ResourceStatistics stats,
                            String location,
                            org.apache.hadoop.mapreduce.Job job)
                     throws IOException
Description copied from interface: StoreMetadata
Store statistics about the data being written.

Specified by:
storeStatistics in interface StoreMetadata
Parameters:
stats - statistics to be recorded
location - Location as returned by LoadFunc.relativeToAbsolutePath(String, org.apache.hadoop.fs.Path)
job - The Job object - this should be used only to obtain cluster properties through JobContext.getConfiguration() and not to set/query any runtime job information.
Throws:
IOException

storeSchema

public void storeSchema(ResourceSchema schema,
                        String location,
                        org.apache.hadoop.mapreduce.Job job)
                 throws IOException
Description copied from interface: StoreMetadata
Store schema of the data being written

Specified by:
storeSchema in interface StoreMetadata
Parameters:
schema - Schema to be recorded
location - Location as returned by LoadFunc.relativeToAbsolutePath(String, org.apache.hadoop.fs.Path)
job - The Job object - this should be used only to obtain cluster properties through JobContext.getConfiguration() and not to set/query any runtime job information.
Throws:
IOException

setFieldDel

public void setFieldDel(byte fieldDel)

setRecordDel

public void setRecordDel(byte recordDel)


Copyright © ${year} The Apache Software Foundation