Overview (Pig 0.18.0 API)

pig
Package	Description
org.apache.pig	Public interfaces and classes for Pig.
org.apache.pig.backend
org.apache.pig.backend.datastorage
org.apache.pig.backend.executionengine
org.apache.pig.backend.hadoop
org.apache.pig.backend.hadoop.accumulo
org.apache.pig.backend.hadoop.datastorage
org.apache.pig.backend.hadoop.executionengine
org.apache.pig.backend.hadoop.executionengine.fetch
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.plans
org.apache.pig.backend.hadoop.executionengine.optimizer
org.apache.pig.backend.hadoop.executionengine.physicalLayer	Implementation of physical operators that use hadoop as the execution engine and data storage.
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.regex
org.apache.pig.backend.hadoop.executionengine.physicalLayer.plans
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators
org.apache.pig.backend.hadoop.executionengine.physicalLayer.util
org.apache.pig.backend.hadoop.executionengine.spark
org.apache.pig.backend.hadoop.executionengine.spark.converter
org.apache.pig.backend.hadoop.executionengine.spark.operator
org.apache.pig.backend.hadoop.executionengine.spark.optimizer
org.apache.pig.backend.hadoop.executionengine.spark.plan
org.apache.pig.backend.hadoop.executionengine.spark.running
org.apache.pig.backend.hadoop.executionengine.spark.streaming
org.apache.pig.backend.hadoop.executionengine.tez
org.apache.pig.backend.hadoop.executionengine.tez.plan
org.apache.pig.backend.hadoop.executionengine.tez.plan.operator
org.apache.pig.backend.hadoop.executionengine.tez.plan.optimizer
org.apache.pig.backend.hadoop.executionengine.tez.plan.udf
org.apache.pig.backend.hadoop.executionengine.tez.runtime
org.apache.pig.backend.hadoop.executionengine.tez.util
org.apache.pig.backend.hadoop.executionengine.util
org.apache.pig.backend.hadoop.hbase
org.apache.pig.backend.hadoop.streaming
org.apache.pig.builtin	This package contains builtin Pig UDFs.
org.apache.pig.builtin.mock
org.apache.pig.classification
org.apache.pig.data	This package contains implementations of Pig specific data types as well as support functions for reading, writing, and using all Pig data types.
org.apache.pig.data.utils
org.apache.pig.impl
org.apache.pig.impl.bloom
org.apache.pig.impl.builtin
org.apache.pig.impl.io
org.apache.pig.impl.io.compress
org.apache.pig.impl.logicalLayer	The logical operators that represent a pig script and tools for manipulating those operators.
org.apache.pig.impl.logicalLayer.schema
org.apache.pig.impl.logicalLayer.validators
org.apache.pig.impl.plan
org.apache.pig.impl.plan.optimizer
org.apache.pig.impl.streaming
org.apache.pig.impl.util
org.apache.pig.impl.util.avro
org.apache.pig.impl.util.hive
org.apache.pig.newplan
org.apache.pig.newplan.logical
org.apache.pig.newplan.logical.expression
org.apache.pig.newplan.logical.optimizer
org.apache.pig.newplan.logical.relational
org.apache.pig.newplan.logical.rules
org.apache.pig.newplan.logical.visitor
org.apache.pig.newplan.optimizer
org.apache.pig.parser
org.apache.pig.pen
org.apache.pig.pen.util
org.apache.pig.scripting
org.apache.pig.scripting.groovy
org.apache.pig.scripting.jruby
org.apache.pig.scripting.js
org.apache.pig.scripting.jython
org.apache.pig.scripting.streaming.python
org.apache.pig.tools
org.apache.pig.tools.cmdline
org.apache.pig.tools.counters
org.apache.pig.tools.grunt
org.apache.pig.tools.parameters
org.apache.pig.tools.pigstats
org.apache.pig.tools.pigstats.mapreduce
org.apache.pig.tools.pigstats.spark
org.apache.pig.tools.pigstats.tez
org.apache.pig.tools.streams
org.apache.pig.tools.timer
org.apache.pig.validator

pig

Package

Description

org.apache.pig

Public interfaces and classes for Pig.

org.apache.pig.backend

org.apache.pig.backend.datastorage

org.apache.pig.backend.executionengine

org.apache.pig.backend.hadoop

org.apache.pig.backend.hadoop.accumulo

org.apache.pig.backend.hadoop.datastorage

org.apache.pig.backend.hadoop.executionengine

org.apache.pig.backend.hadoop.executionengine.fetch

org.apache.pig.backend.hadoop.executionengine.mapReduceLayer

org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners

org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.plans

org.apache.pig.backend.hadoop.executionengine.optimizer

org.apache.pig.backend.hadoop.executionengine.physicalLayer

Implementation of physical operators that use hadoop as the execution engine and data storage.

org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators

org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.regex

org.apache.pig.backend.hadoop.executionengine.physicalLayer.plans

org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators

org.apache.pig.backend.hadoop.executionengine.physicalLayer.util

org.apache.pig.backend.hadoop.executionengine.spark

org.apache.pig.backend.hadoop.executionengine.spark.converter

org.apache.pig.backend.hadoop.executionengine.spark.operator

org.apache.pig.backend.hadoop.executionengine.spark.optimizer

org.apache.pig.backend.hadoop.executionengine.spark.plan

org.apache.pig.backend.hadoop.executionengine.spark.running

org.apache.pig.backend.hadoop.executionengine.spark.streaming

org.apache.pig.backend.hadoop.executionengine.tez

org.apache.pig.backend.hadoop.executionengine.tez.plan

org.apache.pig.backend.hadoop.executionengine.tez.plan.operator

org.apache.pig.backend.hadoop.executionengine.tez.plan.optimizer

org.apache.pig.backend.hadoop.executionengine.tez.plan.udf

org.apache.pig.backend.hadoop.executionengine.tez.runtime

org.apache.pig.backend.hadoop.executionengine.tez.util

org.apache.pig.backend.hadoop.executionengine.util

org.apache.pig.backend.hadoop.hbase

org.apache.pig.backend.hadoop.streaming

org.apache.pig.builtin

This package contains builtin Pig UDFs.

org.apache.pig.builtin.mock

org.apache.pig.classification

org.apache.pig.data

This package contains implementations of Pig specific data types as well as support functions for reading, writing, and using all Pig data types.

org.apache.pig.data.utils

org.apache.pig.impl

org.apache.pig.impl.bloom

org.apache.pig.impl.builtin

org.apache.pig.impl.io

org.apache.pig.impl.io.compress

org.apache.pig.impl.logicalLayer

The logical operators that represent a pig script and tools for manipulating those operators.

org.apache.pig.impl.logicalLayer.schema

org.apache.pig.impl.logicalLayer.validators

org.apache.pig.impl.plan

org.apache.pig.impl.plan.optimizer

org.apache.pig.impl.streaming

org.apache.pig.impl.util

org.apache.pig.impl.util.avro

org.apache.pig.impl.util.hive

org.apache.pig.newplan

org.apache.pig.newplan.logical

org.apache.pig.newplan.logical.expression

org.apache.pig.newplan.logical.optimizer

org.apache.pig.newplan.logical.relational

org.apache.pig.newplan.logical.rules

org.apache.pig.newplan.logical.visitor

org.apache.pig.newplan.optimizer

org.apache.pig.parser

org.apache.pig.pen

org.apache.pig.pen.util

org.apache.pig.scripting

org.apache.pig.scripting.groovy

org.apache.pig.scripting.jruby

org.apache.pig.scripting.js

org.apache.pig.scripting.jython

org.apache.pig.scripting.streaming.python

org.apache.pig.tools

org.apache.pig.tools.cmdline

org.apache.pig.tools.counters

org.apache.pig.tools.grunt

org.apache.pig.tools.parameters

org.apache.pig.tools.pigstats

org.apache.pig.tools.pigstats.mapreduce

org.apache.pig.tools.pigstats.spark

org.apache.pig.tools.pigstats.tez

org.apache.pig.tools.streams

org.apache.pig.tools.timer

org.apache.pig.validator

Pig is a platform for a data flow programming on large data sets in a parallel environment. It consists of a language to specify these programs, Pig Latin, a compiler for this language, and an execution engine to execute the programs.

Pig runs on hadoop MapReduce, reading data from and writing data to HDFS, and doing processing via one or more MapReduce jobs.

Design

This section gives a very high overview of the design of the Pig system. Throughout the documents you can see design for that package or class by looking for the Design heading in the documentation.

Overview

Pig's design is guided by our pig philosophy.

Pig shares many similarities with a traditional RDBMS design. It has a parser, type checker, optimizer, and operators that perform the data processing. However, there are some significant differences. Pig does not have a data catalog, there are no transactions, pig does not directly manage data storage, nor does it implement the execution framework.

High Level Architecture

Pig is split between the front and back ends of the engine. In the front end, the parser transforms a Pig Latin script into a logical plan. Semantic checks (such as type checking) and some optimizations (such as determining which fields in the data need to be read to satisfy the script) are done on this Logical Plan. The Logical Plan is than transformed into a PhysicalPlan. This Physical Plan contains the operators that will be applied to the data. This is then divided into a set of MapReduce jobs by the MRCompiler into an MROperPlan. This MROperPlan (aka the map reduce plan) is then optimized (for example, the combiner is used where possible, jobs that scan the same input data are combined where possible, etc.). Finally a set of MapReduce jobs are generated by the JobControlCompiler. These are submitted to Hadoop and monitored by the MapReduceLauncher.

On the backend, each PigGenericMapReduce.Map, PigCombiner.Combine, and PigGenericMapReduce.Reduce use the pipeline of physical operators constructed in the front end to load, process, and store data.

Programmatic Interface

In addition to the command line and grunt interfaces, users can connect to PigServer from a Java program.

Pig makes it easy for users to extend its functionality by implementing User Defined Functions (UDFs). There are interfaces for defining functions to load data LoadFunc, storing data StoreFunc, doing evaluations on fields (including collections of data, so user defined aggregates are possible) EvalFunc and filtering data FilterFunc.