public class IndexedStorage extends PigStorage implements IndexableLoadFunc
IndexedStorage
is a form of PigStorage
that supports a
per record seek. IndexedStorage
creates a separate (hidden) index file for
every data file that is written. The format of the index file is:
| Header | | Index Body | | Footer |The Header contains the list of record indices (field numbers) that represent index keys. The Index Body contains a
Tuple
for each record in the data.
The fields of the Tuple
are:
Tuple
Tuple
in the index. Tuple
in the index. IndexStorage
implements IndexableLoadFunc
and
can be used as the 'right table' in a PIG 'merge' or 'merge-sparse' join.
IndexStorage
does not require the data to be globally partitioned & sorted
by index keys. Each partition (separate index) must be locally sorted.
Also note IndexStorage is a loader to demonstrate "merge-sparse" join.Modifier and Type | Class and Description |
---|---|
static class |
IndexedStorage.IndexedStorageInputFormat
Internal InputFormat class
|
static class |
IndexedStorage.IndexedStorageOutputFormat
Internal OutputFormat class
|
static class |
IndexedStorage.IndexManager
IndexManager manages the index file (both writing and reading)
It keeps track of the last index read during reading. |
LoadPushDown.OperatorSet, LoadPushDown.RequiredField, LoadPushDown.RequiredFieldList, LoadPushDown.RequiredFieldResponse
Modifier and Type | Field and Description |
---|---|
protected int |
currentReaderIndexStart
Index into the the list of readers to the current reader.
|
protected byte |
fieldDelimiter
Delimiter to use between fields
|
protected int[] |
offsetsToIndexKeys
Offsets to index keys in tuple
|
protected java.util.Comparator<IndexedStorage.IndexedStorageInputFormat.IndexedStorageRecordReader> |
readerComparator
Comparator used to compare key tuples.
|
protected IndexedStorage.IndexedStorageInputFormat.IndexedStorageRecordReader[] |
readers
List of record readers.
|
caster, in, mLog, mRequiredColumns, schema, signature, writer
Constructor and Description |
---|
IndexedStorage(java.lang.String delimiter,
java.lang.String offsetsToIndexKeys)
Constructs a Pig Storer that uses specified regex as a field delimiter.
|
Modifier and Type | Method and Description |
---|---|
void |
close()
A method called by the Pig runtime to give an opportunity
for implementations to perform cleanup actions like closing
the underlying input stream.
|
org.apache.hadoop.mapreduce.InputFormat |
getInputFormat()
This will be called during planning on the front end.
|
Tuple |
getNext()
Retrieves the next tuple to be processed.
|
org.apache.hadoop.mapreduce.OutputFormat |
getOutputFormat()
Return the OutputFormat associated with StoreFuncInterface.
|
void |
initialize(org.apache.hadoop.conf.Configuration conf)
IndexableLoadFunc interface implementation
|
void |
seekNear(Tuple keys)
This method is called by the Pig runtime to indicate
to the LoadFunc to position its underlying input stream
near the keys supplied as the argument.
|
checkSchema, cleanupOnFailure, cleanupOnSuccess, cleanupOutput, equals, equals, getFeatures, getPartitionKeys, getSchema, getStatistics, hashCode, prepareToRead, prepareToWrite, pushProjection, putNext, readField, relToAbsPathForStoreLocation, setLocation, setPartitionFilter, setStoreFuncUDFContextSignature, setStoreLocation, setUDFContextSignature, shouldOverwrite, storeSchema, storeStatistics
getSplitComparable
getAbsolutePath, getCacheFiles, getLoadCaster, getPathStrings, getShipFiles, join, relativeToAbsolutePath, warn
protected IndexedStorage.IndexedStorageInputFormat.IndexedStorageRecordReader[] readers
protected int currentReaderIndexStart
protected byte fieldDelimiter
protected final int[] offsetsToIndexKeys
protected java.util.Comparator<IndexedStorage.IndexedStorageInputFormat.IndexedStorageRecordReader> readerComparator
public IndexedStorage(java.lang.String delimiter, java.lang.String offsetsToIndexKeys)
delimiter
- - field delimiter to useoffsetsToIndexKeys
- - list of offset into Tuple for index keys (comma separated)public org.apache.hadoop.mapreduce.OutputFormat getOutputFormat()
StoreFuncInterface
getOutputFormat
in interface StoreFuncInterface
getOutputFormat
in class PigStorage
OutputFormat
associated with StoreFuncInterfacepublic org.apache.hadoop.mapreduce.InputFormat getInputFormat()
LoadFunc
getInputFormat
in class PigStorage
public Tuple getNext() throws java.io.IOException
LoadFunc
getNext
in class PigStorage
java.io.IOException
- if there is an exception while retrieving the next
tuplepublic void initialize(org.apache.hadoop.conf.Configuration conf) throws java.io.IOException
initialize
in interface IndexableLoadFunc
conf
- The job configuration objectjava.io.IOException
public void seekNear(Tuple keys) throws java.io.IOException
IndexableLoadFunc
seekNear
in interface IndexableLoadFunc
keys
- Tuple with join keys (which are a prefix of the sort
keys of the input data). For example if the data is sorted on
columns in position 2,4,5 any of the following Tuples are
valid as an argument value:
(fieldAt(2))
(fieldAt(2), fieldAt(4))
(fieldAt(2), fieldAt(4), fieldAt(5))
The following are some invalid cases:
(fieldAt(4))
(fieldAt(2), fieldAt(5))
(fieldAt(4), fieldAt(5))java.io.IOException
- When the loadFunc is unable to position
to the required point in its input streampublic void close() throws java.io.IOException
IndexableLoadFunc
close
in interface IndexableLoadFunc
java.io.IOException
- if the loadfunc is unable to perform
its close actions.Copyright © 2007-2012 The Apache Software Foundation