org.apache.pig.piggybank.storage.avro
Class AvroStorageUtils

java.lang.Object
  extended by org.apache.pig.piggybank.storage.avro.AvroStorageUtils

public class AvroStorageUtils
extends Object

This is utility class for this package


Field Summary
static org.apache.avro.Schema BooleanSchema
           
static org.apache.avro.Schema BytesSchema
           
static org.apache.avro.Schema DoubleSchema
           
static org.apache.avro.Schema FloatSchema
           
static org.apache.avro.Schema IntSchema
           
static org.apache.avro.Schema LongSchema
           
static org.apache.avro.Schema NullSchema
           
static org.apache.hadoop.fs.PathFilter PATH_FILTER
          ignore hdfs files with prefix "_" and "."
static org.apache.avro.Schema StringSchema
           
 
Constructor Summary
AvroStorageUtils()
           
 
Method Summary
static boolean containsGenericUnion(org.apache.avro.Schema s)
          determine whether the input schema contains generic unions
protected static boolean containsGenericUnion(org.apache.avro.Schema s, Set<org.apache.avro.Schema> visitedRecords)
          Called by containsGenericUnion(Schema) and it recursively checks whether the input schema contains generic unions.
static boolean containsRecursiveRecord(org.apache.avro.Schema s)
          determine whether the input schema contains recursive records
protected static boolean containsRecursiveRecord(org.apache.avro.Schema s, Set<String> definedRecordNames)
          Called by containsRecursiveRecord(Schema) and it recursively checks whether the input schema contains recursive records.
static org.apache.avro.Schema.Field createUDField(int index, org.apache.avro.Schema s)
          create an avro field using the given schema
static org.apache.avro.Schema createUDPartialRecordSchema()
          create an avro field with null schema (it is a space holder)
static org.apache.avro.Schema getAcceptedType(org.apache.avro.Schema in)
          extract schema from a nullable union
static Set<org.apache.hadoop.fs.Path> getAllFilesRecursively(Set<org.apache.hadoop.fs.Path> basePaths, org.apache.hadoop.conf.Configuration conf)
          Returns all non-hidden files recursively inside the base paths given
static org.apache.hadoop.fs.Path getLast(org.apache.hadoop.fs.Path path, org.apache.hadoop.fs.FileSystem fs)
          get last file of a hdfs path if it is a directory; or return the file itself if path is a file
static Set<org.apache.hadoop.fs.Path> getPaths(String pathString, org.apache.hadoop.conf.Configuration conf, boolean failIfNotFound)
          Gets the list of paths from the pathString specified which may contain comma-separated paths and glob style path
static org.apache.avro.Schema getSchema(org.apache.hadoop.fs.Path path, org.apache.hadoop.fs.FileSystem fs)
          This method is called by #getAvroSchema.
static Map<org.apache.hadoop.fs.Path,Map<Integer,Integer>> getSchemaToMergedSchemaMap(org.apache.avro.Schema mergedSchema, Map<org.apache.hadoop.fs.Path,org.apache.avro.Schema> mergedFiles)
          When merging multiple avro record schemas, we build a map (schemaToMergedSchemaMap) to associate each input record with a remapping of its fields relative to the merged schema.
static org.apache.avro.Schema.Field getUDField(org.apache.avro.Schema s, int index)
          get field schema given index number
static boolean isAcceptableUnion(org.apache.avro.Schema in)
          determine whether a union is a nullable union; note that this function doesn't check containing types of the input union recursively.
static boolean isTupleWrapper(ResourceSchema.ResourceFieldSchema pigSchema)
          check whether it is just a wrapped tuple
static boolean isUDPartialRecordSchema(org.apache.avro.Schema s)
          check whether a schema is a space holder (using field name)
static org.apache.avro.Schema mergeSchema(org.apache.avro.Schema x, org.apache.avro.Schema y)
          This method merges two avro schemas into one.
static boolean noDir(org.apache.hadoop.fs.FileStatus[] ss)
          check whether there is NO directory in the input file (status) list
static ResourceSchema.ResourceFieldSchema wrapAsTuple(ResourceSchema.ResourceFieldSchema subFieldSchema)
          wrap a pig schema as tuple
static org.apache.avro.Schema wrapAsUnion(org.apache.avro.Schema schema, boolean nullable)
          Wrap an avro schema as a nullable union if needed.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

BooleanSchema

public static org.apache.avro.Schema BooleanSchema

LongSchema

public static org.apache.avro.Schema LongSchema

FloatSchema

public static org.apache.avro.Schema FloatSchema

DoubleSchema

public static org.apache.avro.Schema DoubleSchema

IntSchema

public static org.apache.avro.Schema IntSchema

StringSchema

public static org.apache.avro.Schema StringSchema

BytesSchema

public static org.apache.avro.Schema BytesSchema

NullSchema

public static org.apache.avro.Schema NullSchema

PATH_FILTER

public static org.apache.hadoop.fs.PathFilter PATH_FILTER
ignore hdfs files with prefix "_" and "."

Constructor Detail

AvroStorageUtils

public AvroStorageUtils()
Method Detail

createUDField

public static org.apache.avro.Schema.Field createUDField(int index,
                                                         org.apache.avro.Schema s)
create an avro field using the given schema


createUDPartialRecordSchema

public static org.apache.avro.Schema createUDPartialRecordSchema()
create an avro field with null schema (it is a space holder)


isUDPartialRecordSchema

public static boolean isUDPartialRecordSchema(org.apache.avro.Schema s)
check whether a schema is a space holder (using field name)


getUDField

public static org.apache.avro.Schema.Field getUDField(org.apache.avro.Schema s,
                                                      int index)
get field schema given index number


getPaths

public static Set<org.apache.hadoop.fs.Path> getPaths(String pathString,
                                                      org.apache.hadoop.conf.Configuration conf,
                                                      boolean failIfNotFound)
                                               throws IOException
Gets the list of paths from the pathString specified which may contain comma-separated paths and glob style path

Throws:
IOException

getAllFilesRecursively

public static Set<org.apache.hadoop.fs.Path> getAllFilesRecursively(Set<org.apache.hadoop.fs.Path> basePaths,
                                                                    org.apache.hadoop.conf.Configuration conf)
                                                             throws IOException
Returns all non-hidden files recursively inside the base paths given

Throws:
IOException

noDir

public static boolean noDir(org.apache.hadoop.fs.FileStatus[] ss)
check whether there is NO directory in the input file (status) list


getLast

public static org.apache.hadoop.fs.Path getLast(org.apache.hadoop.fs.Path path,
                                                org.apache.hadoop.fs.FileSystem fs)
                                         throws IOException
get last file of a hdfs path if it is a directory; or return the file itself if path is a file

Throws:
IOException

mergeSchema

public static org.apache.avro.Schema mergeSchema(org.apache.avro.Schema x,
                                                 org.apache.avro.Schema y)
                                          throws IOException
This method merges two avro schemas into one. Note that not every avro schema can be merged. For complex types to be merged, they must be the same type. For primitive types to be merged, they must meet certain conditions. For schemas that cannot be merged, an exception is thrown.

Parameters:
x - first avro schema to merge
y - second avro schema to merge
Returns:
merged avro schema
Throws:
IOException

getSchemaToMergedSchemaMap

public static Map<org.apache.hadoop.fs.Path,Map<Integer,Integer>> getSchemaToMergedSchemaMap(org.apache.avro.Schema mergedSchema,
                                                                                             Map<org.apache.hadoop.fs.Path,org.apache.avro.Schema> mergedFiles)
                                                                                      throws IOException
When merging multiple avro record schemas, we build a map (schemaToMergedSchemaMap) to associate each input record with a remapping of its fields relative to the merged schema. Take the following two schemas for example: // path1 { "type": "record", "name": "x", "fields": [ { "name": "xField", "type": "string" } ] } // path2 { "type": "record", "name": "y", "fields": [ { "name": "yField", "type": "string" } ] } The merged schema will be something like this: // merged { "type": "record", "name": "merged", "fields": [ { "name": "xField", "type": "string" }, { "name": "yField", "type": "string" } ] } The schemaToMergedSchemaMap will look like this: // schemaToMergedSchemaMap { path1 : { 0 : 0 }, path2 : { 0 : 1 } } The meaning of the map is: - The field at index '0' of 'path1' is moved to index '0' in merged schema. - The field at index '0' of 'path2' is moved to index '1' in merged schema. With this map, we can now remap the field position of the original schema to that of the merged schema. This is necessary because in the backend, we don't use the merged avro schema but embedded avro schemas of input files to load them. Therefore, we must relocate each field from old positions in the original schema to new positions in the merged schema.

Parameters:
mergedSchema - new schema generated from multiple input schemas
mergedFiles - input avro files that are merged
Returns:
schemaToMergedSchemaMap that maps old position of each field in the original schema to new position in the new schema
Throws:
IOException

wrapAsUnion

public static org.apache.avro.Schema wrapAsUnion(org.apache.avro.Schema schema,
                                                 boolean nullable)
Wrap an avro schema as a nullable union if needed. For instance, wrap schema "int" as ["null", "int"]


containsRecursiveRecord

public static boolean containsRecursiveRecord(org.apache.avro.Schema s)
determine whether the input schema contains recursive records


containsRecursiveRecord

protected static boolean containsRecursiveRecord(org.apache.avro.Schema s,
                                                 Set<String> definedRecordNames)
Called by containsRecursiveRecord(Schema) and it recursively checks whether the input schema contains recursive records.


containsGenericUnion

public static boolean containsGenericUnion(org.apache.avro.Schema s)
determine whether the input schema contains generic unions


containsGenericUnion

protected static boolean containsGenericUnion(org.apache.avro.Schema s,
                                              Set<org.apache.avro.Schema> visitedRecords)
Called by containsGenericUnion(Schema) and it recursively checks whether the input schema contains generic unions.


isAcceptableUnion

public static boolean isAcceptableUnion(org.apache.avro.Schema in)
determine whether a union is a nullable union; note that this function doesn't check containing types of the input union recursively.


wrapAsTuple

public static ResourceSchema.ResourceFieldSchema wrapAsTuple(ResourceSchema.ResourceFieldSchema subFieldSchema)
                                                      throws IOException
wrap a pig schema as tuple

Throws:
IOException

isTupleWrapper

public static boolean isTupleWrapper(ResourceSchema.ResourceFieldSchema pigSchema)
check whether it is just a wrapped tuple


getAcceptedType

public static org.apache.avro.Schema getAcceptedType(org.apache.avro.Schema in)
extract schema from a nullable union


getSchema

public static org.apache.avro.Schema getSchema(org.apache.hadoop.fs.Path path,
                                               org.apache.hadoop.fs.FileSystem fs)
                                        throws IOException
This method is called by #getAvroSchema. The default implementation returns the schema of an avro file; or the schema of the last file in a first-level directory (it does not contain sub-directories).

Parameters:
path - path of a file or first level directory
fs - file system
Returns:
avro schema
Throws:
IOException


Copyright © 2007-2012 The Apache Software Foundation