Pig 0.13.0 API

Pig is a platform for a data flow programming on large data sets in a parallel environment.

See:
          Description

pig
org.apache.pig Public interfaces and classes for Pig.
org.apache.pig.backend  
org.apache.pig.backend.datastorage  
org.apache.pig.backend.executionengine  
org.apache.pig.backend.hadoop  
org.apache.pig.backend.hadoop.accumulo  
org.apache.pig.backend.hadoop.datastorage  
org.apache.pig.backend.hadoop.executionengine  
org.apache.pig.backend.hadoop.executionengine.fetch  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.plans  
org.apache.pig.backend.hadoop.executionengine.physicalLayer Implementation of physical operators that use hadoop as the execution engine and data storage.
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators  
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.regex  
org.apache.pig.backend.hadoop.executionengine.physicalLayer.plans  
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators  
org.apache.pig.backend.hadoop.executionengine.physicalLayer.util  
org.apache.pig.backend.hadoop.executionengine.util  
org.apache.pig.backend.hadoop.hbase  
org.apache.pig.backend.hadoop.streaming  
org.apache.pig.builtin This package contains builtin Pig UDFs.
org.apache.pig.builtin.mock  
org.apache.pig.classification  
org.apache.pig.data This package contains implementations of Pig specific data types as well as support functions for reading, writing, and using all Pig data types.
org.apache.pig.data.utils  
org.apache.pig.impl  
org.apache.pig.impl.builtin  
org.apache.pig.impl.io  
org.apache.pig.impl.logicalLayer The logical operators that represent a pig script and tools for manipulating those operators.
org.apache.pig.impl.logicalLayer.schema  
org.apache.pig.impl.logicalLayer.validators  
org.apache.pig.impl.plan  
org.apache.pig.impl.plan.optimizer  
org.apache.pig.impl.streaming  
org.apache.pig.impl.util  
org.apache.pig.impl.util.avro  
org.apache.pig.newplan  
org.apache.pig.newplan.logical  
org.apache.pig.newplan.logical.expression  
org.apache.pig.newplan.logical.optimizer  
org.apache.pig.newplan.logical.relational  
org.apache.pig.newplan.logical.rules  
org.apache.pig.newplan.logical.visitor  
org.apache.pig.newplan.optimizer  
org.apache.pig.parser  
org.apache.pig.pen  
org.apache.pig.pen.util  
org.apache.pig.scripting  
org.apache.pig.scripting.groovy  
org.apache.pig.scripting.jruby  
org.apache.pig.scripting.js  
org.apache.pig.scripting.jython  
org.apache.pig.scripting.streaming.python  
org.apache.pig.tools  
org.apache.pig.tools.cmdline  
org.apache.pig.tools.counters  
org.apache.pig.tools.grunt  
org.apache.pig.tools.parameters  
org.apache.pig.tools.pigstats  
org.apache.pig.tools.pigstats.mapreduce  
org.apache.pig.tools.streams  
org.apache.pig.tools.timer  
org.apache.pig.validator  

 

contrib: Piggybank
org.apache.pig.piggybank.evaluation  
org.apache.pig.piggybank.evaluation.datetime  
org.apache.pig.piggybank.evaluation.datetime.convert  
org.apache.pig.piggybank.evaluation.datetime.diff  
org.apache.pig.piggybank.evaluation.datetime.truncate  
org.apache.pig.piggybank.evaluation.decode  
org.apache.pig.piggybank.evaluation.math  
org.apache.pig.piggybank.evaluation.stats  
org.apache.pig.piggybank.evaluation.string  
org.apache.pig.piggybank.evaluation.util  
org.apache.pig.piggybank.evaluation.util.apachelogparser  
org.apache.pig.piggybank.evaluation.xml  
org.apache.pig.piggybank.storage  
org.apache.pig.piggybank.storage.allloader  
org.apache.pig.piggybank.storage.apachelog  
org.apache.pig.piggybank.storage.avro  
org.apache.pig.piggybank.storage.hiverc  
org.apache.pig.piggybank.storage.partition  

 

Pig is a platform for a data flow programming on large data sets in a parallel environment. It consists of a language to specify these programs, Pig Latin, a compiler for this language, and an execution engine to execute the programs.

Pig runs on hadoop MapReduce, reading data from and writing data to HDFS, and doing processing via one or more MapReduce jobs.

Design

This section gives a very high overview of the design of the Pig system. Throughout the documents you can see design for that package or class by looking for the Design heading in the documentation.

Overview

Pig's design is guided by our pig philosophy.

Pig shares many similarities with a traditional RDBMS design. It has a parser, type checker, optimizer, and operators that perform the data processing. However, there are some significant differences. Pig does not have a data catalog, there are no transactions, pig does not directly manage data storage, nor does it implement the execution framework.

High Level Architecture

Pig is split between the front and back ends of the engine. In the front end, the parser transforms a Pig Latin script into a logical plan. Semantic checks (such as type checking) and some optimizations (such as determining which fields in the data need to be read to satisfy the script) are done on this Logical Plan. The Logical Plan is than transformed into a PhysicalPlan. This Physical Plan contains the operators that will be applied to the data. This is then divided into a set of MapReduce jobs by the MRCompiler into an MROperPlan. This MROperPlan (aka the map reduce plan) is then optimized (for example, the combiner is used where possible, jobs that scan the same input data are combined where possible, etc.). Finally a set of MapReduce jobs are generated by the JobControlCompiler. These are submitted to Hadoop and monitored by the MapReduceLauncher.

On the backend, each PigGenericMapReduce.Map, PigCombiner.Combine, and PigGenericMapReduce.Reduce use the pipeline of physical operators constructed in the front end to load, process, and store data.

Programmatic Interface

In addition to the command line and grunt interfaces, users can connect to PigServer from a Java program.

Pig makes it easy for users to extend its functionality by implementing User Defined Functions (UDFs). There are interfaces for defining functions to load data LoadFunc, storing data StoreFunc, doing evaluations on fields (including collections of data, so user defined aggregates are possible) EvalFunc and filtering data FilterFunc.



Copyright © 2007-2012 The Apache Software Foundation