IndexedStorage (Pig 0.14.0 API)

java.lang.Object
- org.apache.pig.LoadFunc
- - org.apache.pig.FileInputLoadFunc
  - - org.apache.pig.builtin.PigStorage
    - - org.apache.pig.piggybank.storage.IndexedStorage

All Implemented Interfaces:

IndexableLoadFunc, LoadMetadata, LoadPushDown, OrderedLoadFunc, OverwritableStoreFunc, StoreFuncInterface, StoreMetadata
```
public class IndexedStorage
extends PigStorage
implements IndexableLoadFunc
```
IndexedStorage is a form of PigStorage that supports a per record seek. IndexedStorage creates a separate (hidden) index file for every data file that is written. The format of the index file is:
```
 | Header     |
 | Index Body |
 | Footer     |
 
```
The Header contains the list of record indices (field numbers) that represent index keys. The Index Body contains a Tuple for each record in the data. The fields of the Tuple are:
- The index key(s) Tuple
- The number of records that share this index key.
- Offset into the data file to read the first matching record.
The Footer contains sequentially:
- The smallest key(s) Tuple in the index.
- The largest key(s) Tuple in the index.
- The offset in bytes to the start of the footer
IndexStorage implements IndexableLoadFunc and can be used as the 'right table' in a PIG 'merge' or 'merge-sparse' join. IndexStorage does not require the data to be globally partitioned & sorted by index keys. Each partition (separate index) must be locally sorted. Also note IndexStorage is a loader to demonstrate "merge-sparse" join.

Nested Class Summary

Nested Classes
Modifier and Type	Class and Description
`static class`	`IndexedStorage.IndexedStorageInputFormat` Internal InputFormat class
`static class`	`IndexedStorage.IndexedStorageOutputFormat` Internal OutputFormat class
`static class`	`IndexedStorage.IndexManager` `IndexManager` manages the index file (both writing and reading) It keeps track of the last index read during reading.

Nested classes/interfaces inherited from interface org.apache.pig.LoadPushDown
LoadPushDown.OperatorSet, LoadPushDown.RequiredField, LoadPushDown.RequiredFieldList, LoadPushDown.RequiredFieldResponse

Field Summary

Fields
Modifier and Type	Field and Description
`protected int`	`currentReaderIndexStart` Index into the the list of readers to the current reader.
`protected byte`	`fieldDelimiter` Delimiter to use between fields
`protected int[]`	`offsetsToIndexKeys` Offsets to index keys in tuple
`protected java.util.Comparator<IndexedStorage.IndexedStorageInputFormat.IndexedStorageRecordReader>`	`readerComparator` Comparator used to compare key tuples.
`protected IndexedStorage.IndexedStorageInputFormat.IndexedStorageRecordReader[]`	`readers` List of record readers.

Fields inherited from class org.apache.pig.builtin.PigStorage
caster, in, mLog, mRequiredColumns, schema, signature, writer

Constructor Summary

Constructors
Constructor and Description
`IndexedStorage(java.lang.String delimiter, java.lang.String offsetsToIndexKeys)` Constructs a Pig Storer that uses specified regex as a field delimiter.

Method Summary

Methods
Modifier and Type	Method and Description
`void`	`close()` A method called by the Pig runtime to give an opportunity for implementations to perform cleanup actions like closing the underlying input stream.
`org.apache.hadoop.mapreduce.InputFormat`	`getInputFormat()` This will be called during planning on the front end.
`Tuple`	`getNext()` Retrieves the next tuple to be processed.
`org.apache.hadoop.mapreduce.OutputFormat`	`getOutputFormat()` Return the OutputFormat associated with StoreFuncInterface.
`void`	`initialize(org.apache.hadoop.conf.Configuration conf)` IndexableLoadFunc interface implementation
`void`	`seekNear(Tuple keys)` This method is called by the Pig runtime to indicate to the LoadFunc to position its underlying input stream near the keys supplied as the argument.

Methods inherited from class org.apache.pig.builtin.PigStorage
checkSchema, cleanupOnFailure, cleanupOnSuccess, cleanupOutput, equals, equals, getFeatures, getPartitionKeys, getSchema, getStatistics, hashCode, prepareToRead, prepareToWrite, pushProjection, putNext, readField, relToAbsPathForStoreLocation, setLocation, setPartitionFilter, setStoreFuncUDFContextSignature, setStoreLocation, setUDFContextSignature, shouldOverwrite, storeSchema, storeStatistics

Methods inherited from class org.apache.pig.FileInputLoadFunc
getSplitComparable

Methods inherited from class org.apache.pig.LoadFunc
getAbsolutePath, getCacheFiles, getLoadCaster, getPathStrings, getShipFiles, join, relativeToAbsolutePath, warn

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - readers
```
protected IndexedStorage.IndexedStorageInputFormat.IndexedStorageRecordReader[] readers
```
    List of record readers.
  - currentReaderIndexStart
```
protected int currentReaderIndexStart
```
    Index into the the list of readers to the current reader. Readers before this index have been fully scanned for keys.
  - fieldDelimiter
```
protected byte fieldDelimiter
```
    Delimiter to use between fields
  - offsetsToIndexKeys
```
protected final int[] offsetsToIndexKeys
```
    Offsets to index keys in tuple
  - readerComparator
```
protected java.util.Comparator<IndexedStorage.IndexedStorageInputFormat.IndexedStorageRecordReader> readerComparator
```
    Comparator used to compare key tuples.
- Constructor Detail
  - IndexedStorage
```
public IndexedStorage(java.lang.String delimiter,
              java.lang.String offsetsToIndexKeys)
```
    Constructs a Pig Storer that uses specified regex as a field delimiter.
    
    Parameters:
    delimiter - - field delimiter to use
    offsetsToIndexKeys - - list of offset into Tuple for index keys (comma separated)
- Method Detail
  - getOutputFormat
```
public org.apache.hadoop.mapreduce.OutputFormat getOutputFormat()
```
    Description copied from interface: StoreFuncInterface
    
    Return the OutputFormat associated with StoreFuncInterface. This will be called on the front end during planning and on the backend during execution.
    
    Specified by:
    
    getOutputFormat in interface StoreFuncInterface
    
    Overrides:
    
    getOutputFormat in class PigStorage
    
    Returns:
    the OutputFormat associated with StoreFuncInterface
  - getInputFormat
```
public org.apache.hadoop.mapreduce.InputFormat getInputFormat()
```
    Description copied from class: LoadFunc
    
    This will be called during planning on the front end. This is the instance of InputFormat (rather than the class name) because the load function may need to instantiate the InputFormat in order to control how it is constructed.
    
    Overrides:
    
    getInputFormat in class PigStorage
    
    Returns:
    the InputFormat associated with this loader.
  - getNext
```
public Tuple getNext()
              throws java.io.IOException
```
    Description copied from class: LoadFunc
    
    Retrieves the next tuple to be processed. Implementations should NOT reuse tuple objects (or inner member objects) they return across calls and should return a different tuple object in each call.
    
    Overrides:
    
    getNext in class PigStorage
    
    Returns:
    the next tuple to be processed or null if there are no more tuples to be processed.
    
    Throws:
    
    java.io.IOException - if there is an exception while retrieving the next tuple
  - initialize
```
public void initialize(org.apache.hadoop.conf.Configuration conf)
                throws java.io.IOException
```
    IndexableLoadFunc interface implementation
    
    Specified by:
    
    initialize in interface IndexableLoadFunc
    
    Parameters:
    conf - The job configuration object
    
    Throws:
    
    java.io.IOException
  - seekNear
```
public void seekNear(Tuple keys)
              throws java.io.IOException
```
    Description copied from interface: IndexableLoadFunc
    
    This method is called by the Pig runtime to indicate to the LoadFunc to position its underlying input stream near the keys supplied as the argument. Specifically: 1) if the keys are present in the input stream, the loadfunc implementation should position its read position to a record where the key(s) is/are the biggest key(s) less than the key(s) supplied in the argument OR to the record with the first occurrence of the keys(s) supplied. 2) if the key(s) are absent in the input stream, the implementation should position its read position to a record where the key(s) is/are the biggest key(s) less than the key(s) supplied OR to the first record where the key(s) is/are the smallest key(s) greater than the keys(s) supplied. The description above holds for descending order data in a similar manner with "biggest" and "less than" replaced with "smallest" and "greater than" and vice versa.
    
    Specified by:
    
    seekNear in interface IndexableLoadFunc
    
    Parameters:
    keys - Tuple with join keys (which are a prefix of the sort keys of the input data). For example if the data is sorted on columns in position 2,4,5 any of the following Tuples are valid as an argument value: (fieldAt(2)) (fieldAt(2), fieldAt(4)) (fieldAt(2), fieldAt(4), fieldAt(5)) The following are some invalid cases: (fieldAt(4)) (fieldAt(2), fieldAt(5)) (fieldAt(4), fieldAt(5))
    
    Throws:
    
    java.io.IOException - When the loadFunc is unable to position to the required point in its input stream
  - close
```
public void close()
           throws java.io.IOException
```
    Description copied from interface: IndexableLoadFunc
    
    A method called by the Pig runtime to give an opportunity for implementations to perform cleanup actions like closing the underlying input stream. This is necessary since while performing a join the Pig run time may determine than no further join is possible with remaining records and may indicate to the IndexableLoader to cleanup by calling this method.
    
    Specified by:
    
    close in interface IndexableLoadFunc
    
    Throws:
    
    java.io.IOException - if the loadfunc is unable to perform its close actions.

Class IndexedStorage

Nested Class Summary

Nested classes/interfaces inherited from interface org.apache.pig.LoadPushDown

Field Summary

Fields inherited from class org.apache.pig.builtin.PigStorage

Constructor Summary

Method Summary

Methods inherited from class org.apache.pig.builtin.PigStorage

Methods inherited from class org.apache.pig.FileInputLoadFunc

Methods inherited from class org.apache.pig.LoadFunc

Methods inherited from class java.lang.Object

Field Detail

readers

currentReaderIndexStart

fieldDelimiter

offsetsToIndexKeys

readerComparator

Constructor Detail

IndexedStorage

Method Detail

getOutputFormat

getInputFormat

getNext

initialize

seekNear

close