AvroStorageUtils (Pig 0.17.0 API)

java.lang.Object
- org.apache.pig.piggybank.storage.avro.AvroStorageUtils

public class AvroStorageUtils
extends Object

This is utility class for this package

Field Summary

Fields
Modifier and Type	Field and Description
`static org.apache.avro.Schema`	`BooleanSchema`
`static org.apache.avro.Schema`	`BytesSchema`
`static org.apache.avro.Schema`	`DoubleSchema`
`static org.apache.avro.Schema`	`FloatSchema`
`static org.apache.avro.Schema`	`IntSchema`
`static org.apache.avro.Schema`	`LongSchema`
`static org.apache.avro.Schema`	`NullSchema`
`static org.apache.hadoop.fs.PathFilter`	`PATH_FILTER` ignore hdfs files with prefix "_" and "."
`static org.apache.avro.Schema`	`StringSchema`

Constructor Summary

Constructors
Constructor and Description

AvroStorageUtils()

Constructors
Constructor and Description
`AvroStorageUtils()`

Method Summary

All Methods Static Methods Concrete Methods
Modifier and Type	Method and Description
`static boolean`	`containsGenericUnion(org.apache.avro.Schema s)` determine whether the input schema contains generic unions
`protected static boolean`	`containsGenericUnion(org.apache.avro.Schema s, Set<org.apache.avro.Schema> visitedRecords)` Called by `containsGenericUnion(Schema)` and it recursively checks whether the input schema contains generic unions.
`static boolean`	`containsRecursiveRecord(org.apache.avro.Schema s)` determine whether the input schema contains recursive records
`protected static boolean`	`containsRecursiveRecord(org.apache.avro.Schema s, Set<String> definedRecordNames)` Called by `containsRecursiveRecord(Schema)` and it recursively checks whether the input schema contains recursive records.
`static org.apache.avro.Schema.Field`	`createUDField(int index, org.apache.avro.Schema s)` create an avro field using the given schema
`static org.apache.avro.Schema`	`createUDPartialRecordSchema()` create an avro field with null schema (it is a space holder)
`static org.apache.avro.Schema`	`getAcceptedType(org.apache.avro.Schema in)` extract schema from a nullable union
`static Set<org.apache.hadoop.fs.Path>`	`getAllFilesRecursively(Set<org.apache.hadoop.fs.Path> basePaths, org.apache.hadoop.conf.Configuration conf)` Returns all non-hidden files recursively inside the base paths given
`static org.apache.hadoop.fs.Path`	`getLast(org.apache.hadoop.fs.Path path, org.apache.hadoop.fs.FileSystem fs)` get last file of a hdfs path if it is a directory; or return the file itself if path is a file
`static Set<org.apache.hadoop.fs.Path>`	`getPaths(String pathString, org.apache.hadoop.conf.Configuration conf, boolean failIfNotFound)` Gets the list of paths from the pathString specified which may contain comma-separated paths and glob style path
`static org.apache.avro.Schema`	`getSchema(org.apache.hadoop.fs.Path path, org.apache.hadoop.fs.FileSystem fs)` This method is called by `#getAvroSchema`.
`static Map<org.apache.hadoop.fs.Path,Map<Integer,Integer>>`	`getSchemaToMergedSchemaMap(org.apache.avro.Schema mergedSchema, Map<org.apache.hadoop.fs.Path,org.apache.avro.Schema> mergedFiles)` When merging multiple avro record schemas, we build a map (schemaToMergedSchemaMap) to associate each input record with a remapping of its fields relative to the merged schema.
`static org.apache.avro.Schema.Field`	`getUDField(org.apache.avro.Schema s, int index)` get field schema given index number
`static boolean`	`isAcceptableUnion(org.apache.avro.Schema in)` determine whether a union is a nullable union; note that this function doesn't check containing types of the input union recursively.
`static boolean`	`isTupleWrapper(ResourceSchema.ResourceFieldSchema pigSchema)` check whether it is just a wrapped tuple
`static boolean`	`isUDPartialRecordSchema(org.apache.avro.Schema s)` check whether a schema is a space holder (using field name)
`static org.apache.avro.Schema`	`mergeSchema(org.apache.avro.Schema x, org.apache.avro.Schema y)` This method merges two avro schemas into one.
`static boolean`	`noDir(org.apache.hadoop.fs.FileStatus[] ss)` check whether there is NO directory in the input file (status) list
`static ResourceSchema.ResourceFieldSchema`	`wrapAsTuple(ResourceSchema.ResourceFieldSchema subFieldSchema)` wrap a pig schema as tuple
`static org.apache.avro.Schema`	`wrapAsUnion(org.apache.avro.Schema schema, boolean nullable)` Wrap an avro schema as a nullable union if needed.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - BooleanSchema
```
public static org.apache.avro.Schema BooleanSchema
```
  - LongSchema
```
public static org.apache.avro.Schema LongSchema
```
  - FloatSchema
```
public static org.apache.avro.Schema FloatSchema
```
  - DoubleSchema
```
public static org.apache.avro.Schema DoubleSchema
```
  - IntSchema
```
public static org.apache.avro.Schema IntSchema
```
  - StringSchema
```
public static org.apache.avro.Schema StringSchema
```
  - BytesSchema
```
public static org.apache.avro.Schema BytesSchema
```
  - NullSchema
```
public static org.apache.avro.Schema NullSchema
```
  - PATH_FILTER
```
public static org.apache.hadoop.fs.PathFilter PATH_FILTER
```
    ignore hdfs files with prefix "_" and "."
- Constructor Detail
  - AvroStorageUtils
```
public AvroStorageUtils()
```
- Method Detail
  - createUDField
```
public static org.apache.avro.Schema.Field createUDField(int index,
                                                         org.apache.avro.Schema s)
```
    create an avro field using the given schema
  - createUDPartialRecordSchema
```
public static org.apache.avro.Schema createUDPartialRecordSchema()
```
    create an avro field with null schema (it is a space holder)
  - isUDPartialRecordSchema
```
public static boolean isUDPartialRecordSchema(org.apache.avro.Schema s)
```
    check whether a schema is a space holder (using field name)
  - getUDField
```
public static org.apache.avro.Schema.Field getUDField(org.apache.avro.Schema s,
                                                      int index)
```
    get field schema given index number
  - getPaths
```
public static Set<org.apache.hadoop.fs.Path> getPaths(String pathString,
                                                      org.apache.hadoop.conf.Configuration conf,
                                                      boolean failIfNotFound)
                                               throws IOException
```
    Gets the list of paths from the pathString specified which may contain comma-separated paths and glob style path
    
    Throws:
    
    IOException
  - getAllFilesRecursively
```
public static Set<org.apache.hadoop.fs.Path> getAllFilesRecursively(Set<org.apache.hadoop.fs.Path> basePaths,
                                                                    org.apache.hadoop.conf.Configuration conf)
                                                             throws IOException
```
    Returns all non-hidden files recursively inside the base paths given
    
    Throws:
    
    IOException
  - noDir
```
public static boolean noDir(org.apache.hadoop.fs.FileStatus[] ss)
```
    check whether there is NO directory in the input file (status) list
  - getLast
```
public static org.apache.hadoop.fs.Path getLast(org.apache.hadoop.fs.Path path,
                                                org.apache.hadoop.fs.FileSystem fs)
                                         throws IOException
```
    get last file of a hdfs path if it is a directory; or return the file itself if path is a file
    
    Throws:
    
    IOException
  - mergeSchema
```
public static org.apache.avro.Schema mergeSchema(org.apache.avro.Schema x,
                                                 org.apache.avro.Schema y)
                                          throws IOException
```
    This method merges two avro schemas into one. Note that not every avro schema can be merged. For complex types to be merged, they must be the same type. For primitive types to be merged, they must meet certain conditions. For schemas that cannot be merged, an exception is thrown.
    
    Parameters:
    
    x - first avro schema to merge
    
    y - second avro schema to merge
    
    Returns:
    
    merged avro schema
    
    Throws:
    
    IOException
  - getSchemaToMergedSchemaMap
```
public static Map<org.apache.hadoop.fs.Path,Map<Integer,Integer>> getSchemaToMergedSchemaMap(org.apache.avro.Schema mergedSchema,
                                                                                             Map<org.apache.hadoop.fs.Path,org.apache.avro.Schema> mergedFiles)
                                                                                      throws IOException
```
    When merging multiple avro record schemas, we build a map (schemaToMergedSchemaMap) to associate each input record with a remapping of its fields relative to the merged schema. Take the following two schemas for example: // path1 { "type": "record", "name": "x", "fields": [ { "name": "xField", "type": "string" } ] } // path2 { "type": "record", "name": "y", "fields": [ { "name": "yField", "type": "string" } ] } The merged schema will be something like this: // merged { "type": "record", "name": "merged", "fields": [ { "name": "xField", "type": "string" }, { "name": "yField", "type": "string" } ] } The schemaToMergedSchemaMap will look like this: // schemaToMergedSchemaMap { path1 : { 0 : 0 }, path2 : { 0 : 1 } } The meaning of the map is: - The field at index '0' of 'path1' is moved to index '0' in merged schema. - The field at index '0' of 'path2' is moved to index '1' in merged schema. With this map, we can now remap the field position of the original schema to that of the merged schema. This is necessary because in the backend, we don't use the merged avro schema but embedded avro schemas of input files to load them. Therefore, we must relocate each field from old positions in the original schema to new positions in the merged schema.
    
    Parameters:
    
    mergedSchema - new schema generated from multiple input schemas
    
    mergedFiles - input avro files that are merged
    
    Returns:
    
    schemaToMergedSchemaMap that maps old position of each field in the original schema to new position in the new schema
    
    Throws:
    
    IOException
  - wrapAsUnion
```
public static org.apache.avro.Schema wrapAsUnion(org.apache.avro.Schema schema,
                                                 boolean nullable)
```
    Wrap an avro schema as a nullable union if needed. For instance, wrap schema "int" as ["null", "int"]
  - containsRecursiveRecord
```
public static boolean containsRecursiveRecord(org.apache.avro.Schema s)
```
    determine whether the input schema contains recursive records
  - containsRecursiveRecord
```
protected static boolean containsRecursiveRecord(org.apache.avro.Schema s,
                                                 Set<String> definedRecordNames)
```
    Called by containsRecursiveRecord(Schema) and it recursively checks whether the input schema contains recursive records.
  - containsGenericUnion
```
public static boolean containsGenericUnion(org.apache.avro.Schema s)
```
    determine whether the input schema contains generic unions
  - containsGenericUnion
```
protected static boolean containsGenericUnion(org.apache.avro.Schema s,
                                              Set<org.apache.avro.Schema> visitedRecords)
```
    Called by containsGenericUnion(Schema) and it recursively checks whether the input schema contains generic unions.
  - isAcceptableUnion
```
public static boolean isAcceptableUnion(org.apache.avro.Schema in)
```
    determine whether a union is a nullable union; note that this function doesn't check containing types of the input union recursively.
  - wrapAsTuple
```
public static ResourceSchema.ResourceFieldSchema wrapAsTuple(ResourceSchema.ResourceFieldSchema subFieldSchema)
                                                      throws IOException
```
    wrap a pig schema as tuple
    
    Throws:
    
    IOException
  - isTupleWrapper
```
public static boolean isTupleWrapper(ResourceSchema.ResourceFieldSchema pigSchema)
```
    check whether it is just a wrapped tuple
  - getAcceptedType
```
public static org.apache.avro.Schema getAcceptedType(org.apache.avro.Schema in)
```
    extract schema from a nullable union
  - getSchema
```
public static org.apache.avro.Schema getSchema(org.apache.hadoop.fs.Path path,
                                               org.apache.hadoop.fs.FileSystem fs)
                                        throws IOException
```
    This method is called by #getAvroSchema. The default implementation returns the schema of an avro file; or the schema of the last file in a first-level directory (it does not contain sub-directories).
    
    Parameters:
    
    path - path of a file or first level directory
    
    fs - file system
    
    Returns:
    
    avro schema
    
    Throws:
    
    IOException

Class AvroStorageUtils

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

BooleanSchema

LongSchema

FloatSchema

DoubleSchema

IntSchema

StringSchema

BytesSchema

NullSchema

PATH_FILTER

Constructor Detail

AvroStorageUtils

Method Detail

createUDField

createUDPartialRecordSchema

isUDPartialRecordSchema

getUDField

getPaths

getAllFilesRecursively

noDir

getLast

mergeSchema

getSchemaToMergedSchemaMap

wrapAsUnion

containsRecursiveRecord

containsRecursiveRecord

containsGenericUnion

containsGenericUnion

isAcceptableUnion

wrapAsTuple

isTupleWrapper

getAcceptedType

getSchema