org.apache.pig.piggybank.storage.hiverc
Class HiveRCInputFormat

java.lang.Object
  extended by org.apache.hadoop.mapreduce.InputFormat<K,V>
      extended by org.apache.hadoop.mapreduce.lib.input.FileInputFormat<org.apache.hadoop.io.LongWritable,BytesRefArrayWritable>
          extended by org.apache.pig.piggybank.storage.hiverc.HiveRCInputFormat

public class HiveRCInputFormat
extends org.apache.hadoop.mapreduce.lib.input.FileInputFormat<org.apache.hadoop.io.LongWritable,BytesRefArrayWritable>

HiveRCInputFormat used by HiveColumnarLoader as the InputFormat;

Reasons for implementing a new InputFormat sub class:


Constructor Summary
HiveRCInputFormat()
          No date partitioning is applied
HiveRCInputFormat(String dateRange)
          Date partitioning will be applied to the input path.
The path must be partitioned as input-path/daydate=yyyy-MM-dd.
 
Method Summary
 org.apache.hadoop.mapreduce.RecordReader<org.apache.hadoop.io.LongWritable,BytesRefArrayWritable> createRecordReader(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext ctx)
          Initialises an instance of HiveRCRecordReader.
protected  long getFormatMinSplitSize()
          The input split size should never be smaller than the RCFile.SYNC_INTERVAL
protected  List<org.apache.hadoop.fs.FileStatus> listStatus(org.apache.hadoop.mapreduce.JobContext ctx)
          This method is called by the FileInputFormat to find the input paths for which splits should be calculated.
If applyDateRanges == true: Then the HiveRCDateSplitter is used to apply filtering on the input files.
Else the default FileInputFormat listStatus method is used.
 
Methods inherited from class org.apache.hadoop.mapreduce.lib.input.FileInputFormat
addInputPath, addInputPaths, computeSplitSize, getBlockIndex, getInputPathFilter, getInputPaths, getMaxSplitSize, getMinSplitSize, getSplits, isSplitable, setInputPathFilter, setInputPaths, setInputPaths, setMaxInputSplitSize, setMinInputSplitSize
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

HiveRCInputFormat

public HiveRCInputFormat()
No date partitioning is applied


HiveRCInputFormat

public HiveRCInputFormat(String dateRange)
Date partitioning will be applied to the input path.
The path must be partitioned as input-path/daydate=yyyy-MM-dd.

Parameters:
dateRange - Must have format yyyy-MM-dd:yyyy-MM-dd with the left most being the start of the range.
Method Detail

createRecordReader

public org.apache.hadoop.mapreduce.RecordReader<org.apache.hadoop.io.LongWritable,BytesRefArrayWritable> createRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
                                                                                                                            org.apache.hadoop.mapreduce.TaskAttemptContext ctx)
                                                                                                                     throws IOException,
                                                                                                                            InterruptedException
Initialises an instance of HiveRCRecordReader.

Specified by:
createRecordReader in class org.apache.hadoop.mapreduce.InputFormat<org.apache.hadoop.io.LongWritable,BytesRefArrayWritable>
Throws:
IOException
InterruptedException

listStatus

protected List<org.apache.hadoop.fs.FileStatus> listStatus(org.apache.hadoop.mapreduce.JobContext ctx)
                                                    throws IOException
This method is called by the FileInputFormat to find the input paths for which splits should be calculated.
If applyDateRanges == true: Then the HiveRCDateSplitter is used to apply filtering on the input files.
Else the default FileInputFormat listStatus method is used.

Overrides:
listStatus in class org.apache.hadoop.mapreduce.lib.input.FileInputFormat<org.apache.hadoop.io.LongWritable,BytesRefArrayWritable>
Throws:
IOException

getFormatMinSplitSize

protected long getFormatMinSplitSize()
The input split size should never be smaller than the RCFile.SYNC_INTERVAL

Overrides:
getFormatMinSplitSize in class org.apache.hadoop.mapreduce.lib.input.FileInputFormat<org.apache.hadoop.io.LongWritable,BytesRefArrayWritable>


Copyright © ${year} The Apache Software Foundation