org.apache.pig.piggybank.storage
Class IndexedStorage

java.lang.Object
  extended by org.apache.pig.LoadFunc
      extended by org.apache.pig.FileInputLoadFunc
          extended by org.apache.pig.builtin.PigStorage
              extended by org.apache.pig.piggybank.storage.IndexedStorage
All Implemented Interfaces:
IndexableLoadFunc, LoadMetadata, LoadPushDown, OrderedLoadFunc, OverwritableStoreFunc, StoreFuncInterface, StoreMetadata

public class IndexedStorage
extends PigStorage
implements IndexableLoadFunc

IndexedStorage is a form of PigStorage that supports a per record seek. IndexedStorage creates a separate (hidden) index file for every data file that is written. The format of the index file is:

 | Header     |
 | Index Body |
 | Footer     |
 
The Header contains the list of record indices (field numbers) that represent index keys. The Index Body contains a Tuple for each record in the data. The fields of the Tuple are: The Footer contains sequentially: IndexStorage implements IndexableLoadFunc and can be used as the 'right table' in a PIG 'merge' or 'merge-sparse' join. IndexStorage does not require the data to be globally partitioned & sorted by index keys. Each partition (separate index) must be locally sorted. Also note IndexStorage is a loader to demonstrate "merge-sparse" join.


Nested Class Summary
static class IndexedStorage.IndexedStorageInputFormat
          Internal InputFormat class
static class IndexedStorage.IndexedStorageOutputFormat
          Internal OutputFormat class
static class IndexedStorage.IndexManager
          IndexManager manages the index file (both writing and reading) It keeps track of the last index read during reading.
 
Nested classes/interfaces inherited from interface org.apache.pig.LoadPushDown
LoadPushDown.OperatorSet, LoadPushDown.RequiredField, LoadPushDown.RequiredFieldList, LoadPushDown.RequiredFieldResponse
 
Field Summary
protected  int currentReaderIndexStart
          Index into the the list of readers to the current reader.
protected  byte fieldDelimiter
          Delimiter to use between fields
protected  int[] offsetsToIndexKeys
          Offsets to index keys in tuple
protected  Comparator<IndexedStorage.IndexedStorageInputFormat.IndexedStorageRecordReader> readerComparator
          Comparator used to compare key tuples.
protected  IndexedStorage.IndexedStorageInputFormat.IndexedStorageRecordReader[] readers
          List of record readers.
 
Fields inherited from class org.apache.pig.builtin.PigStorage
caster, in, mLog, mRequiredColumns, schema, signature, writer
 
Constructor Summary
IndexedStorage(String delimiter, String offsetsToIndexKeys)
          Constructs a Pig Storer that uses specified regex as a field delimiter.
 
Method Summary
 void close()
          A method called by the Pig runtime to give an opportunity for implementations to perform cleanup actions like closing the underlying input stream.
 org.apache.hadoop.mapreduce.InputFormat getInputFormat()
          This will be called during planning on the front end.
 Tuple getNext()
          Retrieves the next tuple to be processed.
 org.apache.hadoop.mapreduce.OutputFormat getOutputFormat()
          Return the OutputFormat associated with StoreFuncInterface.
 void initialize(org.apache.hadoop.conf.Configuration conf)
          IndexableLoadFunc interface implementation
 void seekNear(Tuple keys)
          This method is called by the Pig runtime to indicate to the LoadFunc to position its underlying input stream near the keys supplied as the argument.
 
Methods inherited from class org.apache.pig.builtin.PigStorage
checkSchema, cleanupOnFailure, cleanupOnSuccess, cleanupOutput, equals, equals, getFeatures, getPartitionKeys, getSchema, getStatistics, hashCode, prepareToRead, prepareToWrite, pushProjection, putNext, readField, relToAbsPathForStoreLocation, setLocation, setPartitionFilter, setStoreFuncUDFContextSignature, setStoreLocation, setUDFContextSignature, shouldOverwrite, storeSchema, storeStatistics
 
Methods inherited from class org.apache.pig.FileInputLoadFunc
getSplitComparable
 
Methods inherited from class org.apache.pig.LoadFunc
getAbsolutePath, getLoadCaster, getPathStrings, join, relativeToAbsolutePath, warn
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

readers

protected IndexedStorage.IndexedStorageInputFormat.IndexedStorageRecordReader[] readers
List of record readers.


currentReaderIndexStart

protected int currentReaderIndexStart
Index into the the list of readers to the current reader. Readers before this index have been fully scanned for keys.


fieldDelimiter

protected byte fieldDelimiter
Delimiter to use between fields


offsetsToIndexKeys

protected final int[] offsetsToIndexKeys
Offsets to index keys in tuple


readerComparator

protected Comparator<IndexedStorage.IndexedStorageInputFormat.IndexedStorageRecordReader> readerComparator
Comparator used to compare key tuples.

Constructor Detail

IndexedStorage

public IndexedStorage(String delimiter,
                      String offsetsToIndexKeys)
Constructs a Pig Storer that uses specified regex as a field delimiter.

Parameters:
delimiter - - field delimiter to use
offsetsToIndexKeys - - list of offset into Tuple for index keys (comma separated)
Method Detail

getOutputFormat

public org.apache.hadoop.mapreduce.OutputFormat getOutputFormat()
Description copied from interface: StoreFuncInterface
Return the OutputFormat associated with StoreFuncInterface. This will be called on the front end during planning and on the backend during execution.

Specified by:
getOutputFormat in interface StoreFuncInterface
Overrides:
getOutputFormat in class PigStorage
Returns:
the OutputFormat associated with StoreFuncInterface

getInputFormat

public org.apache.hadoop.mapreduce.InputFormat getInputFormat()
Description copied from class: LoadFunc
This will be called during planning on the front end. This is the instance of InputFormat (rather than the class name) because the load function may need to instantiate the InputFormat in order to control how it is constructed.

Overrides:
getInputFormat in class PigStorage
Returns:
the InputFormat associated with this loader.

getNext

public Tuple getNext()
              throws IOException
Description copied from class: LoadFunc
Retrieves the next tuple to be processed. Implementations should NOT reuse tuple objects (or inner member objects) they return across calls and should return a different tuple object in each call.

Overrides:
getNext in class PigStorage
Returns:
the next tuple to be processed or null if there are no more tuples to be processed.
Throws:
IOException - if there is an exception while retrieving the next tuple

initialize

public void initialize(org.apache.hadoop.conf.Configuration conf)
                throws IOException
IndexableLoadFunc interface implementation

Specified by:
initialize in interface IndexableLoadFunc
Parameters:
conf - The job configuration object
Throws:
IOException

seekNear

public void seekNear(Tuple keys)
              throws IOException
Description copied from interface: IndexableLoadFunc
This method is called by the Pig runtime to indicate to the LoadFunc to position its underlying input stream near the keys supplied as the argument. Specifically: 1) if the keys are present in the input stream, the loadfunc implementation should position its read position to a record where the key(s) is/are the biggest key(s) less than the key(s) supplied in the argument OR to the record with the first occurrence of the keys(s) supplied. 2) if the key(s) are absent in the input stream, the implementation should position its read position to a record where the key(s) is/are the biggest key(s) less than the key(s) supplied OR to the first record where the key(s) is/are the smallest key(s) greater than the keys(s) supplied. The description above holds for descending order data in a similar manner with "biggest" and "less than" replaced with "smallest" and "greater than" and vice versa.

Specified by:
seekNear in interface IndexableLoadFunc
Parameters:
keys - Tuple with join keys (which are a prefix of the sort keys of the input data). For example if the data is sorted on columns in position 2,4,5 any of the following Tuples are valid as an argument value: (fieldAt(2)) (fieldAt(2), fieldAt(4)) (fieldAt(2), fieldAt(4), fieldAt(5)) The following are some invalid cases: (fieldAt(4)) (fieldAt(2), fieldAt(5)) (fieldAt(4), fieldAt(5))
Throws:
IOException - When the loadFunc is unable to position to the required point in its input stream

close

public void close()
           throws IOException
Description copied from interface: IndexableLoadFunc
A method called by the Pig runtime to give an opportunity for implementations to perform cleanup actions like closing the underlying input stream. This is necessary since while performing a join the Pig run time may determine than no further join is possible with remaining records and may indicate to the IndexableLoader to cleanup by calling this method.

Specified by:
close in interface IndexableLoadFunc
Throws:
IOException - if the loadfunc is unable to perform its close actions.


Copyright © 2007-2012 The Apache Software Foundation