org.apache.hadoop.zebra.mapreduce
Class TableInputFormat

java.lang.Object
  extended by org.apache.hadoop.mapreduce.InputFormat<org.apache.hadoop.io.BytesWritable,Tuple>
      extended by org.apache.hadoop.zebra.mapreduce.TableInputFormat

public class TableInputFormat
extends org.apache.hadoop.mapreduce.InputFormat<org.apache.hadoop.io.BytesWritable,Tuple>

InputFormat class for reading one or more BasicTables. Usage Example:

In the main program, add the following code.

 job.setInputFormatClass(TableInputFormat.class);
 TableInputFormat.setInputPaths(jobContext, new Path("path/to/table1", new Path("path/to/table2");
 TableInputFormat.setProjection(jobContext, "Name, Salary, BonusPct");
 
The above code does the following things: The user Mapper code should look like the following:
 static class MyMapClass implements Mapper<BytesWritable, Tuple, K, V> {
   // keep the tuple object for reuse.
   // indices of various fields in the input Tuple.
   int idxName, idxSalary, idxBonusPct;
 
   @Override
   public void configure(Job job) {
     Schema projection = TableInputFormat.getProjection(job);
     // determine the field indices.
     idxName = projection.getColumnIndex("Name");
     idxSalary = projection.getColumnIndex("Salary");
     idxBonusPct = projection.getColumnIndex("BonusPct");
   }
 
   @Override
   public void map(BytesWritable key, Tuple value, OutputCollector<K, V> output,
       Reporter reporter) throws IOException {
     try {
       String name = (String) value.get(idxName);
       int salary = (Integer) value.get(idxSalary);
       double bonusPct = (Double) value.get(idxBonusPct);
       // do something with the input data
     } catch (ExecException e) {
       e.printStackTrace();
     }
   }
 
   @Override
   public void close() throws IOException {
     // no-op
   }
 }
 
A little bit more explanation on the PIG Tuple objects. A Tuple is an ordered list of PIG datum objects. The permitted PIG datum types can be categorized as Scalar types and Composite types.

Supported Scalar types include seven native Java types: Boolean, Byte, Integer, Long, Float, Double, String, as well as one PIG class called DataByteArray that represents type-less byte array.

Supported Composite types include:


Nested Class Summary
static class TableInputFormat.SplitMode
           
 
Constructor Summary
TableInputFormat()
           
 
Method Summary
 org.apache.hadoop.mapreduce.RecordReader<org.apache.hadoop.io.BytesWritable,Tuple> createRecordReader(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext taContext)
           
static TableRecordReader createTableRecordReader(org.apache.hadoop.mapreduce.JobContext jobContext, String projection)
          Get a TableRecordReader on a single split
static String getProjection(org.apache.hadoop.mapreduce.JobContext jobContext)
          Get the projection from the JobContext
static Schema getSchema(org.apache.hadoop.mapreduce.JobContext jobContext)
          Get the schema of a table expr
static org.apache.hadoop.io.WritableComparable<?> getSortedTableSplitComparable(org.apache.hadoop.mapreduce.InputSplit inputSplit)
          Get a comparable object from the given InputSplit object.
static SortInfo getSortInfo(org.apache.hadoop.mapreduce.JobContext jobContext)
          Get the SortInfo object regarding a Zebra table
 List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext jobContext)
           
static void requireSortedTable(org.apache.hadoop.mapreduce.JobContext jobContext, ZebraSortInfo sortInfo)
          Deprecated.  
static void setInputPaths(org.apache.hadoop.mapreduce.JobContext jobContext, org.apache.hadoop.fs.Path... paths)
          Set the paths to the input table.
static void setMinSplitSize(org.apache.hadoop.mapreduce.JobContext jobContext, long minSize)
          Set the minimum split size.
static void setProjection(org.apache.hadoop.mapreduce.JobContext jobContext, String projection)
          Deprecated. Use setProjection(JobContext, ZebraProjection) instead.
static void setProjection(org.apache.hadoop.mapreduce.JobContext jobContext, ZebraProjection projection)
          Set the input projection in the JobContext object.
static void setSplitMode(org.apache.hadoop.mapreduce.JobContext jobContext, TableInputFormat.SplitMode sm, ZebraSortInfo sortInfo)
           
 void validateInput(org.apache.hadoop.mapreduce.JobContext jobContext)
          Deprecated. 
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

TableInputFormat

public TableInputFormat()
Method Detail

setInputPaths

public static void setInputPaths(org.apache.hadoop.mapreduce.JobContext jobContext,
                                 org.apache.hadoop.fs.Path... paths)
Set the paths to the input table.

Parameters:
conf - JobContext object.
paths - one or more paths to BasicTables. The InputFormat class will produce splits on the "union" of these BasicTables.

getSchema

public static Schema getSchema(org.apache.hadoop.mapreduce.JobContext jobContext)
                        throws IOException
Get the schema of a table expr

Parameters:
jobContext - JobContext object.
Throws:
IOException

setProjection

public static void setProjection(org.apache.hadoop.mapreduce.JobContext jobContext,
                                 String projection)
                          throws org.apache.hadoop.zebra.parser.ParseException
Deprecated. Use setProjection(JobContext, ZebraProjection) instead.

Set the input projection in the JobContext object.

Parameters:
jobContext - JobContext object.
projection - A common separated list of column names. If we want select all columns, pass projection==null. The syntax of the projection conforms to the Schema string.
Throws:
org.apache.hadoop.zebra.parser.ParseException

setProjection

public static void setProjection(org.apache.hadoop.mapreduce.JobContext jobContext,
                                 ZebraProjection projection)
                          throws org.apache.hadoop.zebra.parser.ParseException
Set the input projection in the JobContext object.

Parameters:
jobContext - JobContext object.
projection - A common separated list of column names. If we want select all columns, pass projection==null. The syntax of the projection conforms to the Schema string.
Throws:
org.apache.hadoop.zebra.parser.ParseException

getProjection

public static String getProjection(org.apache.hadoop.mapreduce.JobContext jobContext)
                            throws IOException,
                                   org.apache.hadoop.zebra.parser.ParseException
Get the projection from the JobContext

Parameters:
jobContext - The JobContext object
Returns:
The projection schema. If projection has not been defined, or is not known at this time, null will be returned. Note that by the time when this method is called in Mapper code, the projection must already be known.
Throws:
IOException
org.apache.hadoop.zebra.parser.ParseException

getSortInfo

public static SortInfo getSortInfo(org.apache.hadoop.mapreduce.JobContext jobContext)
                            throws IOException
Get the SortInfo object regarding a Zebra table

Parameters:
jobContext - JobContext object
Returns:
the zebra tables's SortInfo; null if the table is unsorted.
Throws:
IOException

requireSortedTable

public static void requireSortedTable(org.apache.hadoop.mapreduce.JobContext jobContext,
                                      ZebraSortInfo sortInfo)
                               throws IOException
Deprecated. 

Requires sorted table or table union

Parameters:
jobContext - JobContext object.
sortInfo - ZebraSortInfo object containing sorting information.
Throws:
IOException

setSplitMode

public static void setSplitMode(org.apache.hadoop.mapreduce.JobContext jobContext,
                                TableInputFormat.SplitMode sm,
                                ZebraSortInfo sortInfo)
                         throws IOException
Parameters:
conf - JonConf object
sm - Split mode: unsorted, globally sorted, locally sorted. Default is unsorted
sortInfo - ZebraSortInfo object containing sorting information. Will be ignored if the split mode is null or unsorted
Throws:
IOException

createRecordReader

public org.apache.hadoop.mapreduce.RecordReader<org.apache.hadoop.io.BytesWritable,Tuple> createRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
                                                                                                             org.apache.hadoop.mapreduce.TaskAttemptContext taContext)
                                                                                                      throws IOException,
                                                                                                             InterruptedException
Specified by:
createRecordReader in class org.apache.hadoop.mapreduce.InputFormat<org.apache.hadoop.io.BytesWritable,Tuple>
Throws:
IOException
InterruptedException
See Also:
InputFormat.createRecordReader(InputSplit, TaskAttemptContext)

createTableRecordReader

public static TableRecordReader createTableRecordReader(org.apache.hadoop.mapreduce.JobContext jobContext,
                                                        String projection)
                                                 throws IOException,
                                                        org.apache.hadoop.zebra.parser.ParseException,
                                                        InterruptedException
Get a TableRecordReader on a single split

Parameters:
jobContext - JobContext object.
projection - comma-separated column names in projection. null means all columns in projection
Throws:
IOException
org.apache.hadoop.zebra.parser.ParseException
InterruptedException

setMinSplitSize

public static void setMinSplitSize(org.apache.hadoop.mapreduce.JobContext jobContext,
                                   long minSize)
Set the minimum split size.

Parameters:
jobContext - The job conf object.
minSize - Minimum size.

getSplits

public List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext jobContext)
                                                       throws IOException
Specified by:
getSplits in class org.apache.hadoop.mapreduce.InputFormat<org.apache.hadoop.io.BytesWritable,Tuple>
Throws:
IOException
See Also:
InputFormat.getSplits(JobContext)

validateInput

@Deprecated
public void validateInput(org.apache.hadoop.mapreduce.JobContext jobContext)
                   throws IOException
Deprecated. 

Throws:
IOException

getSortedTableSplitComparable

public static org.apache.hadoop.io.WritableComparable<?> getSortedTableSplitComparable(org.apache.hadoop.mapreduce.InputSplit inputSplit)
Get a comparable object from the given InputSplit object.

Parameters:
inputSplit - An InputSplit instance. It should be type of SortedTableSplit.
Returns:
a comparable object of type WritableComparable


Copyright © ${year} The Apache Software Foundation