Apache > Hadoop > Pig
 

Testing and Diagnostics

Diagnostic Operators

DESCRIBE

Returns the schema of a relation.

Syntax

DESCRIBE alias;       

Terms

alias

The name of a relation.

Usage

Use the DESCRIBE operator to view the schema of a relation. You can view outer relations as well as relations defined in a nested FOREACH statement.

Example

In this example a schema is specified using the AS clause. If all data conforms to the schema, Pig will use the assigned types.

A = LOAD 'student' AS (name:chararray, age:int, gpa:float);

B = FILTER A BY name matches 'J.+';

C = GROUP B BY name;

D = FOREACH C GENERATE COUNT(B.age);

DESCRIBE A;
A: {name: chararray,age: int,gpa: float}

DESCRIBE B;
B: {name: chararray,age: int,gpa: float}

DESCRIBE C;
C: {group: chararray,B: {(name: chararray,age: int,gpa: float)}}

DESCRIBE D;
D: {long}

In this example no schema is specified. All fields default to type bytearray or long (see Data Types).

a = LOAD 'student';

b = FILTER a BY $0 matches 'J.+';

c = GROUP b BY $0;

d = FOREACH c GENERATE COUNT(b.$1);

DESCRIBE a;
Schema for a unknown.

DESCRIBE b;
2008-12-05 01:17:15,316 [main] WARN  org.apache.pig.PigServer - bytearray is implicitly cast to chararray under LORegexp Operator
Schema for b unknown.

DESCRIBE c;
2008-12-05 01:17:23,343 [main] WARN  org.apache.pig.PigServer - bytearray is implicitly caste to chararray under LORegexp Operator
c: {group: bytearray,b: {null}}

DESCRIBE d;
2008-12-05 03:04:30,076 [main] WARN  org.apache.pig.PigServer - bytearray is implicitly caste to chararray under LORegexp Operator
d: {long}

This example shows how to view the schema of a nested relation using the :: operator.

A = LOAD 'studentab10k' AS (name, age, gpa); 
B = GROUP A BY name; 
C = FOREACH B { 
     D = DISTINCT A.age; 
     GENERATE COUNT(D), group;} 

DESCRIBE C::D; 
D: {age: bytearray} 

DUMP

Dumps or displays results to screen.

Syntax

DUMP alias;       

Terms

alias

The name of a relation.

Usage

Use the DUMP operator to run (execute) Pig Latin statements and display the results to your screen. DUMP is meant for interactive mode; statements are executed immediately and the results are not saved (persisted). You can use DUMP as a debugging device to make sure that the results you are expecting are actually generated.

Note that production scripts SHOULD NOT use DUMP as it will disable multi-query optimizations and is likely to slow down execution (see Store vs. Dump).

Example

In this example a dump is performed after each statement.

A = LOAD 'student' AS (name:chararray, age:int, gpa:float);

DUMP A;
(John,18,4.0F)
(Mary,19,3.7F)
(Bill,20,3.9F)
(Joe,22,3.8F)
(Jill,20,4.0F)

B = FILTER A BY name matches 'J.+';

DUMP B;
(John,18,4.0F)
(Joe,22,3.8F)
(Jill,20,4.0F)

EXPLAIN

Displays execution plans.

Syntax

EXPLAIN [–script pigscript] [–out path] [–brief] [–dot] [-xml] [–param param_name = param_value] [–param_file file_name] alias; 

Terms

–script

Use to specify a Pig script.

–out

Use to specify the output path (directory).

Will generate a logical_plan[.txt|.dot], physical_plan[.text|.dot], exec_plan[.text|.dot] file in the specified path.

Default (no path specified): Stdout

–brief

Does not expand nested plans (presenting a smaller graph for overview).

–dot, -xml

Text mode (default): multiple output (split) will be broken out in sections.

Dot mode: outputs a format that can be passed to the dot utility for graphical display – will generate a directed-acyclic-graph (DAG) of the plans in any supported format (.gif, .jpg ...).

Xml mode: outputs a xml which represent the plan (only logical plan is shown currently).

–param param_name = param_value

See Parameter Substitution.

–param_file file_name

See Parameter Substitution.

alias

The name of a relation.

Usage

Use the EXPLAIN operator to review the logical, physical, and map reduce execution plans that are used to compute the specified relationship.

If no script is given:

  • The logical plan shows a pipeline of operators to be executed to build the relation. Type checking and backend-independent optimizations (such as applying filters early on) also apply.

  • The physical plan shows how the logical operators are translated to backend-specific physical operators. Some backend optimizations also apply.

  • The mapreduce plan shows how the physical operators are grouped into map reduce jobs.

If a script without an alias is specified, it will output the entire execution graph (logical, physical, or map reduce).

If a script with a alias is specified, it will output the plan for the given alias.

Example

In this example the EXPLAIN operator produces all three plans. (Note that only a portion of the output is shown in this example.)

A = LOAD 'student' AS (name:chararray, age:int, gpa:float);

B = GROUP A BY name;

C = FOREACH B GENERATE COUNT(A.age);

EXPLAIN C;
-----------------------------------------------
Logical Plan:
-----------------------------------------------
Store xxx-Fri Dec 05 19:42:29 UTC 2008-23 Schema: {long} Type: Unknown
|
|---ForEach xxx-Fri Dec 05 19:42:29 UTC 2008-15 Schema: {long} Type: bag
 etc ...  

-----------------------------------------------
Physical Plan:
-----------------------------------------------
Store(fakefile:org.apache.pig.builtin.PigStorage) - xxx-Fri Dec 05 19:42:29 UTC 2008-40
|
|---New For Each(false)[bag] - xxx-Fri Dec 05 19:42:29 UTC 2008-39
    |   |
    |   POUserFunc(org.apache.pig.builtin.COUNT)[long] - xxx-Fri Dec 05 
 etc ...  

--------------------------------------------------
| Map Reduce Plan                               
-------------------------------------------------
MapReduce node xxx-Fri Dec 05 19:42:29 UTC 2008-41
Map Plan
Local Rearrange[tuple]{chararray}(false) - xxx-Fri Dec 05 19:42:29 UTC 2008-34
|   |
|   Project[chararray][0] - xxx-Fri Dec 05 19:42:29 UTC 2008-35
 etc ...  

If you are running in Tez mode, Map Reduce Plan will be replaced with Tez Plan:

#--------------------------------------------------
# There are 1 DAGs in the session
#--------------------------------------------------
#--------------------------------------------------
# TEZ DAG plan: PigLatin:185.pig-0_scope-0
#--------------------------------------------------
Tez vertex scope-21	->	Tez vertex scope-22,
Tez vertex scope-22

Tez vertex scope-21
# Plan on vertex
B: Local Rearrange[tuple]{chararray}(false) - scope-35	->	 scope-22
 etc ...  

ILLUSTRATE

Displays a step-by-step execution of a sequence of statements.

Syntax

ILLUSTRATE {alias | -script scriptfile}; 

Terms

alias

The name of a relation.

-script scriptfile

The script keyword followed by the name of a Pig script (for example, myscript.pig).

The script file should not contain an ILLUSTRATE statement.

Usage

Use the ILLUSTRATE operator to review how data is transformed through a sequence of Pig Latin statements. ILLUSTRATE allows you to test your programs on small datasets and get faster turnaround times.

ILLUSTRATE is based on an example generator (see Generating Example Data for Dataflow Programs). The algorithm works by retrieving a small sample of the input data and then propagating this data through the pipeline. However, some operators, such as JOIN and FILTER, can eliminate tuples from the data - and this could result in no data following through the pipeline. To address this issue, the algorithm will automatically generate example data, in near real-time. Thus, you might see data propagating through the pipeline that was not found in the original input data, but this data changes nothing and ensures that you will be able to examine the semantics of your Pig Latin statements.

As shown in the examples below, you can use ILLUSTRATE to review a relation or an entire Pig script.

Example - Relation

This example demonstrates how to use ILLUSTRATE with a relation. Note that the LOAD statement must include a schema (the AS clause).

grunt> visits = LOAD 'visits.txt' AS (user:chararray, url:chararray, timestamp:chararray);
grunt> DUMP visits;

(Amy,yahoo.com,19990421)
(Fred,harvard.edu,19991104)
(Amy,cnn.com,20070218)
(Frank,nba.com,20070305)
(Fred,berkeley.edu,20071204)
(Fred,stanford.edu,20071206)

grunt> recent_visits = FILTER visits BY timestamp >= '20071201';
grunt> user_visits = GROUP recent_visits BY user;
grunt> num_user_visits = FOREACH user_visits GENERATE group, COUNT(recent_visits);
grunt> DUMP num_user_visits;

(Fred,2)

grunt> ILLUSTRATE num_user_visits;
------------------------------------------------------------------------
| visits     | user: chararray | url: chararray | timestamp: chararray |
------------------------------------------------------------------------
|            | Fred            | berkeley.edu   | 20071204             |
|            | Fred            | stanford.edu   | 20071206             |
|            | Frank           | nba.com        | 20070305             |
------------------------------------------------------------------------
-------------------------------------------------------------------------------
| recent_visits     | user: chararray | url: chararray | timestamp: chararray |
-------------------------------------------------------------------------------
|                   | Fred            | berkeley.edu   | 20071204             |
|                   | Fred            | stanford.edu   | 20071206             |
-------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------
| user_visits     | group: chararray | recent_visits: bag({user: chararray,url: chararray,timestamp: chararray}) |
------------------------------------------------------------------------------------------------------------------
|                 | Fred             | {(Fred, berkeley.edu, 20071204), (Fred, stanford.edu, 20071206)}          |
------------------------------------------------------------------------------------------------------------------
--------------------------------------------------
| num_user_visits     | group: chararray | long  |
--------------------------------------------------
|                     | Fred             | 2     |
--------------------------------------------------

Example - Script

This example demonstrates how to use ILLUSTRATE with a Pig script. Note that the script itself should not contain an ILLUSTRATE statement.

grunt> cat visits.txt
Amy     yahoo.com       19990421
Fred    harvard.edu     19991104
Amy     cnn.com 20070218
Frank   nba.com 20070305
Fred    berkeley.edu    20071204
Fred    stanford.edu    20071206

grunt> cat visits.pig
visits = LOAD 'visits.txt' AS (user, url, timestamp);
recent_visits = FILTER visits BY timestamp >= '20071201';
historical_visits = FILTER visits BY timestamp <= '20000101';
DUMP recent_visits;
DUMP historical_visits;
STORE recent_visits INTO 'recent';
STORE historical_visits INTO 'historical';

grunt> exec visits.pig

(Fred,berkeley.edu,20071204)
(Fred,stanford.edu,20071206)

(Amy,yahoo.com,19990421)
(Fred,harvard.edu,19991104)


grunt> illustrate -script visits.pig

------------------------------------------------------------------------
| visits     | user: bytearray | url: bytearray | timestamp: bytearray |
------------------------------------------------------------------------
|            | Amy             | yahoo.com      | 19990421             |
|            | Fred            | stanford.edu   | 20071206             |
------------------------------------------------------------------------
-------------------------------------------------------------------------------
| recent_visits     | user: bytearray | url: bytearray | timestamp: bytearray |
-------------------------------------------------------------------------------
|                   | Fred            | stanford.edu   | 20071206             |
-------------------------------------------------------------------------------
---------------------------------------------------------------------------------------
| Store : recent_visits     | user: bytearray | url: bytearray | timestamp: bytearray |
---------------------------------------------------------------------------------------
|                           | Fred            | stanford.edu   | 20071206             |
---------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
| historical_visits     | user: bytearray | url: bytearray | timestamp: bytearray |
-----------------------------------------------------------------------------------
|                       | Amy             | yahoo.com      | 19990421             |
-----------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------
| Store : historical_visits     | user: bytearray | url: bytearray | timestamp: bytearray |
-------------------------------------------------------------------------------------------
|                               | Amy             | yahoo.com      | 19990421             |
-------------------------------------------------------------------------------------------

Pig Scripts and MapReduce Job IDs (MapReduce mode only)

Complex Pig scripts often generate many MapReduce jobs. To help you debug a script, Pig prints a summary of the execution that shows which relations (aliases) are mapped to each MapReduce job.

JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime 
    MinReduceTime AvgReduceTime Alias Feature Outputs
job_201004271216_12712 1 1 3 3 3 12 12 12 B,C GROUP_BY,COMBINER
job_201004271216_12713 1 1 3 3 3 12 12 12 D SAMPLER
job_201004271216_12714 1 1 3 3 3 12 12 12 D ORDER_BY,COMBINER 
    hdfs://mymachine.com:9020/tmp/temp743703298/tmp-2019944040,

Pig Statistics

Pig Statistics is a framework for collecting and storing script-level statistics for Pig Latin. Characteristics of Pig Latin scripts and the resulting MapReduce jobs are collected while the script is executed. These statistics are then available for Pig users and tools using Pig (such as Oozie) to retrieve after the job is done.

The new Pig statistics and the existing Hadoop statistics can also be accessed via the Hadoop job history file (and job xml file). Piggybank has a HadoopJobHistoryLoader which acts as an example of using Pig itself to query these statistics (the loader can be used as a reference implementation but is NOT supported for production use).

Java API

Several new public classes make it easier for external tools such as Oozie to integrate with Pig statistics.

The Pig statistics are available here: http://pig.apache.org/docs/r0.14.0/api/

The stats classes are in the package: org.apache.pig.tools.pigstats

  • PigStats
  • SimplePigStats
  • EmbeddedPigStats
  • JobStats
  • TezPigScriptStats
  • TezDAGStats
  • TezVertexStats
  • OutputStats
  • InputStats

The PigRunner class mimics the behavior of the Main class but gives users a statistics object back. Optionally, you can call the API with an implementation of progress listener which will be invoked by Pig runtime during the execution.

package org.apache.pig;

public abstract class PigRunner {
    public static PigStats run(String[] args, PigProgressNotificationListener listener)
}

public interface PigProgressNotificationListener extends java.util.EventListener {
    // just before the launch of MR jobs for the script
    public void LaunchStartedNotification(int numJobsToLaunch);
    // number of jobs submitted in a batch
    public void jobsSubmittedNotification(int numJobsSubmitted);
    // a job is started
    public void jobStartedNotification(String assignedJobId);
    // a job is completed successfully
    public void jobFinishedNotification(JobStats jobStats);
    // a job is failed
    public void jobFailedNotification(JobStats jobStats);
    // a user output is completed successfully
    public void outputCompletedNotification(OutputStats outputStats);
    // updates the progress as percentage
    public void progressUpdatedNotification(int progress);
    // the script execution is done
    public void launchCompletedNotification(int numJobsSucceeded);
}

Depends on the type of the pig script, PigRunner.run() returns a particular subclass of PigStats: SimplePigStats(MapReduce/local mode), TezPigScriptStats(Tez/Tez local mode) or EmbeddedPigStats(embedded script). SimplePigStats contains a map of JobStats which capture the stats for each MapReduce job of the Pig script. TezPigScriptStats contains a map of TezDAGStats which capture the stats for each Tez DAG of the Pig script, and TezDAGStats contains a map of TezVertexStats which capture the stats for each vertex within the Tez DAG. Depending on the execution type, EmbeddedPigStats contains a map of SimplePigStats or TezPigScriptStats, which captures the Pig job launched in the embeded script.

If one is running Pig in Tez mode (or both Tez/MapReduce mode), should pass PigTezProgressNotificationListener which extends PigProgressNotificationListener to PigRunner.run() to make sure to get notification in both Tez mode or MapReduce mode.

Job XML

The following entries are included in job conf:

Pig Statistic

Description

pig.script.id

The UUID for the script. All jobs spawned by the script have the same script ID.

pig.script

The base64 encoded script text.

pig.command.line

The command line used to invoke the script.

pig.hadoop.version

The Hadoop version installed.

pig.version

The Pig version used.

pig.input.dirs

A comma-separated list of input directories for the job.

pig.map.output.dirs

A comma-separated list of output directories in the map phase of the job.

pig.reduce.output.dirs

A comma-separated list of output directories in the reduce phase of the job.

pig.parent.jobid

A comma-separated list of parent job ids.

pig.script.features

A list of Pig features used in the script.

pig.job.feature

A list of Pig features used in the job.

pig.alias

The alias associated with the job.

Hadoop Job History Loader

The HadoopJobHistoryLoader in Piggybank loads Hadoop job history files and job xml files from file system. For each MapReduce job, the loader produces a tuple with schema (j:map[], m:map[], r:map[]). The first map in the schema contains job-related entries. Here are some of important key names in the map:

PIG_SCRIPT_ID

CLUSTER

QUEUE_NAME

JOBID

JOBNAME

STATUS

USER

HADOOP_VERSION

PIG_VERSION

PIG_JOB_FEATURE

PIG_JOB_ALIAS

PIG_JOB_PARENTS

SUBMIT_TIME

LAUNCH_TIME

FINISH_TIME

TOTAL_MAPS

TOTAL_REDUCES

Examples that use the loader to query Pig statistics are shown below.

Examples

Find scripts that generate more then three MapReduce jobs:

a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
b = group a by (j#'PIG_SCRIPT_ID', j#'USER', j#'JOBNAME');
c = foreach b generate group.$1, group.$2, COUNT(a);
d = filter c by $2 > 3;
dump d;

Find the running time of each script (in seconds):

a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, 
         (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as end;
c = group b by (id, user, script_name)
d = foreach c generate group.user, group.script_name, (MAX(b.end) - MIN(b.start)/1000;
dump d;

Find the number of scripts run by user and queue on a cluster:

a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'QUEUE_NAME' as queue;
c = group b by (id, user, queue) parallel 10;
d = foreach c generate group.user, group.queue, COUNT(b);
dump d;

Find scripts that have failed jobs:

a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
b = foreach a generate (Chararray) j#'STATUS' as status, j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, j#'JOBID' as job;
c = filter b by status != 'SUCCESS';
dump c;

Find scripts that use only the default parallelism:

a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) r#'NUMBER_REDUCES' as reduces;
c = group b by (id, user, script_name) parallel 10;
d = foreach c generate group.user, group.script_name, MAX(b.reduces) as max_reduces;
e = filter d by max_reduces == 1;
dump e;

Pig Progress Notification Listener

Pig provides the ability to register a listener to receive event notifications during the execution of a script. Events include MapReduce plan creation, script launch, script progress, script completion, job submit, job start, job completion and job failure.

To register a listener, set the pig.notification.listener parameter to the fully qualified class name of an implementation of org.apache.pig.tools.pigstats.PigProgressNotificationListener. The class must exist on the classpath of the process submitting the Pig job. If the pig.notification.listener.arg parameter is set, the value will be passed to a constructor of the implementing class that takes a single String.

PigUnit

PigUnit is a simple xUnit framework that enables you to easily test your Pig scripts. With PigUnit you can perform unit testing, regression testing, and rapid prototyping. No cluster set up is required if you run Pig in local mode.

Build PigUnit

To compile PigUnit run the command shown below from the Pig trunk. The compile will create the pigunit.jar file.

$pig_trunk ant pigunit-jar   

Run PigUnit

You can run PigUnit using Pig's local mode or mapreduce mode.

Local Mode

PigUnit runs in Pig's local mode by default. Local mode is fast and enables you to use your local file system as the HDFS cluster. Local mode does not require a real cluster but a new local one is created each time.

Other Modes

PigUnit also runs in Pig's mapreduce/tez/tez_local mode. Mapreduce/Tez mode requires you to use a Hadoop cluster and HDFS installation. It is enabled when the Java system property pigunit.exectype is set to specific values (mr/tez/tez_local): e.g. -Dpigunit.exectype=mr or System.getProperties().setProperty("pigunit.exectype", "mr"), which means PigUnit will run in mr mode. The cluster you select to run mr/tez test must be specified in the CLASSPATH (similar to the HADOOP_CONF_DIR variable).

PigUnit Example

Many PigUnit examples are available in the PigUnit tests.

The example included here computes the top N of the most common queries. The Pig script, top_queries.pig, is similar to the Query Phrase Popularity in the Pig tutorial. It expects an input a file of queries and a parameter n (n is 2 in our case in order to do a top 2).

Setting up a test for this script is easy because the argument and the input data are specified by two text arrays. It is the same for the expected output of the script that will be compared to the actual result of the execution of the Pig script.

Java Test

  @Test
  public void testTop2Queries() {
    String[] args = {
        "n=2",
        };
 
    PigTest test = new PigTest("top_queries.pig", args);
 
    String[] input = {
        "yahoo",
        "yahoo",
        "yahoo",
        "twitter",
        "facebook",
        "facebook",
        "linkedin",
    };
 
    String[] output = {
        "(yahoo,3)",
        "(facebook,2)",
    };
 
    test.assertOutput("data", input, "queries_limit", output);
  }

top_queries.pig

data =
    LOAD 'input'
    AS (query:CHARARRAY);
     
queries_group =
    GROUP data
    BY query; 
    
queries_count = 
    FOREACH queries_group 
    GENERATE 
        group AS query, 
        COUNT(data) AS total;
        
queries_ordered =
    ORDER queries_count
    BY total DESC, query;
            
queries_limit =
    LIMIT queries_ordered $n;

STORE queries_limit INTO 'output';

Run

The test can be executed by JUnit (or any other Java testing framework). It requires:

  1. pig.jar
  2. pigunit.jar

The test takes about 25s to run and should pass. In case of error (for example change the parameter n to n=3), the diff of output is displayed:

junit.framework.ComparisonFailure: null expected:<...ahoo,3)
(facebook,2)[]> but was:<...ahoo,3)
(facebook,2)[
(linkedin,1)]>
        at junit.framework.Assert.assertEquals(Assert.java:81)
        at junit.framework.Assert.assertEquals(Assert.java:87)
        at org.apache.pig.pigunit.PigTest.assertEquals(PigTest.java:272)

Troubleshooting Tips

Common problems you may encounter are discussed below.

Classpath in Mapreduce Mode

When using PigUnit in mapreduce mode, be sure to include the $HADOOP_CONF_DIR of the cluster in your CLASSPATH.

MiniCluster generates one in build/classes.

org.apache.pig.backend.executionengine.ExecException: 
ERROR 4010: Cannot find hadoop configurations in classpath 
(neither hadoop-site.xml nor core-site.xml was found in the classpath).
If you plan to use local mode, please put -x local option in command line

UDF jars Not Found

This error means that you are missing some jars in your test environment.

WARN util.JarManager: Couldn't find the jar for 
org.apache.pig.piggybank.evaluation.string.LOWER, skip it

Storing Data

Pig currently drops all STORE and DUMP commands. You can tell PigUnit to keep the commands and execute the script:

test = new PigTest(PIG_SCRIPT, args);   
test.unoverride("STORE");
test.runScript();

Cache Archive

For cache archive to work, your test environment needs to have the cache archive options specified by Java properties or in an additional XML configuration in its CLASSPATH.

If you use a local cluster, you need to set the required environment variables before starting it:

export LD_LIBRARY_PATH=/home/path/to/lib

Future Enhancements

Improvements and other components based on PigUnit that could be built later.

For example, we could build a PigTestCase and PigTestSuite on top of PigTest to:

  1. Add the notion of workspaces for each test.
  2. Remove the boiler plate code appearing when there is more than one test methods.
  3. Add a standalone utility that reads test configurations and generates a test report.