org.apache.pig
Class PigServer

java.lang.Object
  extended by org.apache.pig.PigServer
Direct Known Subclasses:
ToolsPigServer

@InterfaceAudience.Public
@InterfaceStability.Stable
public class PigServer
extends Object

A class for Java programs to connect to Pig. Typically a program will create a PigServer instance. The programmer then registers queries using registerQuery() and retrieves results using openIterator() or store(). After doing so, the shutdown() method should be called to free any resources used by the current PigServer instance. Not doing so could result in a memory leak.


Nested Class Summary
protected  class PigServer.Graph
           
 
Field Summary
protected  Deque<PigServer.Graph> graphs
           
protected  org.apache.commons.logging.Log log
           
protected  PigContext pigContext
           
static String PRETTY_PRINT_SCHEMA_PROPERTY
           
protected  String scope
           
 
Constructor Summary
PigServer(ExecType execType)
           
PigServer(ExecType execType, Properties properties)
           
PigServer(PigContext context)
           
PigServer(PigContext context, boolean connect)
           
PigServer(String execTypeString)
           
 
Method Summary
 void addPathToSkip(String path)
          Add a path to be skipped while automatically shipping binaries for streaming.
 long capacity()
          Returns the unused byte capacity of an HDFS filesystem.
 void debugOff()
          Set the logging level to the default.
 void debugOn()
          Set the logging level to DEBUG.
 boolean deleteFile(String filename)
          Delete a file.
 void discardBatch()
          Discards a batch of Pig commands.
protected  String doParamSubstitution(InputStream in, Map<String,String> params, List<String> paramsFiles)
          Do parameter substitution.
 Schema dumpSchema(String alias)
          Write the schema for an alias to System.out.
 Schema dumpSchemaNested(String alias, String nestedAlias)
          Write the schema for a nestedAlias to System.out.
 List<ExecJob> executeBatch()
          Submits a batch of Pig commands for execution.
 boolean existsFile(String filename)
          Test whether a file exists.
 void explain(String alias, PrintStream stream)
          Provide information on how a pig query will be executed.
 void explain(String alias, String format, boolean verbose, boolean markAsExecute, PrintStream lps, PrintStream pps, PrintStream eps)
          Provide information on how a pig query will be executed.
 long fileSize(String filename)
          Returns the length of a file in bytes which exists in the HDFS (accounts for replication).
 Map<String,LogicalPlan> getAliases()
          Return a map containing the logical plan associated with each alias.
 Set<String> getAliasKeySet()
          Get the set of all current aliases.
protected  PigServer.Graph getClonedGraph()
          Creates a clone of the current DAG
 Map<Operator,DataBag> getExamples(String alias)
           
protected  List<ExecJob> getJobs(PigStats stats)
          Retrieves a list of Job objects from the PigStats object
 PigContext getPigContext()
           
 boolean isBatchEmpty()
          Returns whether there is anything to process in the current batch.
 boolean isBatchOn()
          Retrieve the current execution mode.
protected  PigStats launchPlan(PhysicalPlan pp, String jobName)
          A common method for launching the jobs according to the physical plan
 String[] listPaths(String dir)
          List the contents of a directory.
 boolean mkdirs(String dirs)
          Make a directory.
 Iterator<Tuple> openIterator(String id)
          Executes a Pig Latin script up to and including indicated alias.
 void printAliases()
          Intended to be used by unit tests only.
 void printHistory(boolean withNumbers)
           
 void registerCode(String path, String scriptingLang, String namespace)
          Universal Scripting Language Support, see PIG-928
 void registerFunction(String function, FuncSpec funcSpec)
          Defines an alias for the given function spec.
 void registerJar(String name)
          Registers a jar file.
 void registerQuery(String query)
          Register a query with the Pig runtime.
 void registerQuery(String query, int startLine)
          Register a query with the Pig runtime.
 void registerScript(InputStream in)
          Register a pig script from InputStream source which is more general and extensible the pig script can be from local file, then you can use FileInputStream.
 void registerScript(InputStream in, List<String> paramsFiles)
          Register a pig script from InputStream source which is more general and extensible the pig script can be from local file, then you can use FileInputStream.
 void registerScript(InputStream in, Map<String,String> params)
          Register a pig script from InputStream source which is more general and extensible the pig script can be from local file, then you can use FileInputStream.
 void registerScript(InputStream in, Map<String,String> params, List<String> paramsFiles)
          Register a pig script from InputStream.
The pig script can be from local file, then you can use FileInputStream.
 void registerScript(String fileName)
          Register a query with the Pig runtime.
 void registerScript(String fileName, List<String> paramsFiles)
          Register a pig script file.
 void registerScript(String fileName, Map<String,String> params)
          Register a pig script file.
 void registerScript(String fileName, Map<String,String> params, List<String> paramsFiles)
          Register a pig script file.
 void registerStreamingCommand(String commandAlias, StreamingCommand command)
          Defines an alias for the given streaming command.
 boolean renameFile(String source, String target)
          Rename a file.
 void setBatchOn()
          Starts batch execution mode.
 void setDefaultParallel(int p)
          Set the default parallelism for this job
 void setJobName(String name)
          Set the name of the job.
 void setJobPriority(String priority)
          Set Hadoop job priority.
 void setValidateEachStatement(boolean validateEachStatement)
          This can be called to indicate if the query is being parsed/compiled in a mode that expects each statement to be validated as it is entered, instead of just doing it once for whole script.
 void shutdown()
          Reclaims resources used by this instance of PigServer.
 ExecJob store(String id, String filename)
          Executes a Pig Latin script up to and including indicated alias and stores the resulting records into a file.
 ExecJob store(String id, String filename, String func)
          Executes a Pig Latin script up to and including indicated alias and stores the resulting records into a file.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

log

protected final org.apache.commons.logging.Log log

PRETTY_PRINT_SCHEMA_PROPERTY

public static final String PRETTY_PRINT_SCHEMA_PROPERTY
See Also:
Constant Field Values

graphs

protected final Deque<PigServer.Graph> graphs

pigContext

protected final PigContext pigContext

scope

protected final String scope
Constructor Detail

PigServer

public PigServer(String execTypeString)
          throws ExecException,
                 IOException
Parameters:
execTypeString - can be 'mapreduce' or 'local'. Local mode will use Hadoop's local job runner to execute the job on the local machine. Mapreduce mode will connect to a cluster to execute the job.
Throws:
ExecException
IOException

PigServer

public PigServer(ExecType execType)
          throws ExecException
Parameters:
execType - execution type to start the engine. Local mode will use Hadoop's local job runner to execute the job on the local machine. Mapreduce mode will connect to a cluster to execute the job.
Throws:
ExecException

PigServer

public PigServer(ExecType execType,
                 Properties properties)
          throws ExecException
Throws:
ExecException

PigServer

public PigServer(PigContext context)
          throws ExecException
Throws:
ExecException

PigServer

public PigServer(PigContext context,
                 boolean connect)
          throws ExecException
Throws:
ExecException
Method Detail

getPigContext

public PigContext getPigContext()

debugOn

public void debugOn()
Set the logging level to DEBUG.


debugOff

public void debugOff()
Set the logging level to the default.


setDefaultParallel

public void setDefaultParallel(int p)
Set the default parallelism for this job

Parameters:
p - default number of reducers to use for this job.

setBatchOn

public void setBatchOn()
Starts batch execution mode.


isBatchOn

public boolean isBatchOn()
Retrieve the current execution mode.

Returns:
true if the execution mode is batch; false otherwise.

isBatchEmpty

public boolean isBatchEmpty()
                     throws FrontendException
Returns whether there is anything to process in the current batch.

Returns:
true if there are no stores to process in the current batch, false otherwise.
Throws:
FrontendException

executeBatch

public List<ExecJob> executeBatch()
                           throws IOException
Submits a batch of Pig commands for execution.

Returns:
list of jobs being executed
Throws:
IOException

getJobs

protected List<ExecJob> getJobs(PigStats stats)
Retrieves a list of Job objects from the PigStats object

Parameters:
stats -
Returns:
A list of ExecJob objects

discardBatch

public void discardBatch()
                  throws FrontendException
Discards a batch of Pig commands.

Throws:
FrontendException

addPathToSkip

public void addPathToSkip(String path)
Add a path to be skipped while automatically shipping binaries for streaming.

Parameters:
path - path to be skipped

registerFunction

public void registerFunction(String function,
                             FuncSpec funcSpec)
Defines an alias for the given function spec. This is useful for functions that require arguments to the constructor.

Parameters:
function - - the new function alias to define.
funcSpec - - the FuncSpec object representing the name of the function class and any arguments to constructor.

registerStreamingCommand

public void registerStreamingCommand(String commandAlias,
                                     StreamingCommand command)
Defines an alias for the given streaming command.

Parameters:
commandAlias - - the new command alias to define
command - - streaming command to be executed

registerJar

public void registerJar(String name)
                 throws IOException
Registers a jar file. Name of the jar file can be an absolute or relative path. If multiple resources are found with the specified name, the first one is registered as returned by getSystemResources. A warning is issued to inform the user.

Parameters:
name - of the jar file to register
Throws:
IOException

registerCode

public void registerCode(String path,
                         String scriptingLang,
                         String namespace)
                  throws IOException
Universal Scripting Language Support, see PIG-928

Parameters:
path - path of the script file
scriptingLang - language keyword or scriptingEngine used to interpret the script
namespace - namespace defined for functions of this script
Throws:
IOException

registerQuery

public void registerQuery(String query,
                          int startLine)
                   throws IOException
Register a query with the Pig runtime. The query is parsed and registered, but it is not executed until it is needed.

Parameters:
query - a Pig Latin expression to be evaluated.
startLine - line number of the query within the whole script
Throws:
IOException

registerQuery

public void registerQuery(String query)
                   throws IOException
Register a query with the Pig runtime. The query is parsed and registered, but it is not executed until it is needed. Equivalent to calling registerQuery(String, int) with startLine set to 1.

Parameters:
query - a Pig Latin expression to be evaluated.
Throws:
IOException

registerScript

public void registerScript(InputStream in)
                    throws IOException
Register a pig script from InputStream source which is more general and extensible the pig script can be from local file, then you can use FileInputStream. or pig script can be in memory which you build it dynamically, the you can use ByteArrayInputStream even pig script can be in remote machine, which you get wrap it as SocketInputStream

Parameters:
in -
Throws:
IOException

registerScript

public void registerScript(InputStream in,
                           Map<String,String> params)
                    throws IOException
Register a pig script from InputStream source which is more general and extensible the pig script can be from local file, then you can use FileInputStream. or pig script can be in memory which you build it dynamically, the you can use ByteArrayInputStream even pig script can be in remote machine, which you get wrap it as SocketInputStream. The parameters in the pig script will be substituted with the values in params

Parameters:
in -
params - the key is the parameter name, and the value is the parameter value
Throws:
IOException

registerScript

public void registerScript(InputStream in,
                           List<String> paramsFiles)
                    throws IOException
Register a pig script from InputStream source which is more general and extensible the pig script can be from local file, then you can use FileInputStream. or pig script can be in memory which you build it dynamically, the you can use ByteArrayInputStream even pig script can be in remote machine, which you get wrap it as SocketInputStream The parameters in the pig script will be substituted with the values in the parameter files

Parameters:
in -
paramsFiles - files which have the parameter setting
Throws:
IOException

registerScript

public void registerScript(InputStream in,
                           Map<String,String> params,
                           List<String> paramsFiles)
                    throws IOException
Register a pig script from InputStream.
The pig script can be from local file, then you can use FileInputStream. Or pig script can be in memory which you build it dynamically, the you can use ByteArrayInputStream Pig script can even be in remote machine, which you get wrap it as SocketInputStream.
The parameters in the pig script will be substituted with the values in the map and the parameter files. The values in params Map will override the value in parameter file if they have the same parameter

Parameters:
in -
params - the key is the parameter name, and the value is the parameter value
paramsFiles - files which have the parameter setting
Throws:
IOException

doParamSubstitution

protected String doParamSubstitution(InputStream in,
                                     Map<String,String> params,
                                     List<String> paramsFiles)
                              throws IOException
Do parameter substitution.

Parameters:
in - The InputStream of file containing Pig Latin to do substitution on.
params - Parameters to use to substitute
paramsFiles - Files to use to do substitution.
Returns:
String containing Pig Latin with substitutions done
Throws:
IOException

getClonedGraph

protected PigServer.Graph getClonedGraph()
                                  throws IOException
Creates a clone of the current DAG

Returns:
A Graph object which is a clone of the current DAG
Throws:
IOException

registerScript

public void registerScript(String fileName)
                    throws IOException
Register a query with the Pig runtime. The query will be read from the indicated file.

Parameters:
fileName - file to read query from.
Throws:
IOException

registerScript

public void registerScript(String fileName,
                           Map<String,String> params)
                    throws IOException
Register a pig script file. The parameters in the file will be substituted with the values in params

Parameters:
fileName - pig script file
params - the key is the parameter name, and the value is the parameter value
Throws:
IOException

registerScript

public void registerScript(String fileName,
                           List<String> paramsFiles)
                    throws IOException
Register a pig script file. The parameters in the file will be substituted with the values in the parameter files

Parameters:
fileName - pig script file
paramsFiles - files which have the parameter setting
Throws:
IOException

registerScript

public void registerScript(String fileName,
                           Map<String,String> params,
                           List<String> paramsFiles)
                    throws IOException
Register a pig script file. The parameters in the file will be substituted with the values in the map and the parameter files The values in params Map will override the value in parameter file if they have the same parameter

Parameters:
fileName - pig script
params - the key is the parameter name, and the value is the parameter value
paramsFiles - files which have the parameter setting
Throws:
IOException

printAliases

public void printAliases()
                  throws FrontendException
Intended to be used by unit tests only. Print a list of all aliases in in the current Pig Latin script. Output is written to System.out.

Throws:
FrontendException

dumpSchema

public Schema dumpSchema(String alias)
                  throws IOException
Write the schema for an alias to System.out.

Parameters:
alias - Alias whose schema will be written out
Returns:
Schema of alias dumped
Throws:
IOException

dumpSchemaNested

public Schema dumpSchemaNested(String alias,
                               String nestedAlias)
                        throws IOException
Write the schema for a nestedAlias to System.out. Denoted by alias::nestedAlias.

Parameters:
alias - Alias whose schema has nestedAlias
nestedAlias - Alias whose schema will be written out
Returns:
Schema of alias dumped
Throws:
IOException

setJobName

public void setJobName(String name)
Set the name of the job. This name will get translated to mapred.job.name.

Parameters:
name - of job

setJobPriority

public void setJobPriority(String priority)
Set Hadoop job priority. This value will get translated to mapred.job.priority.

Parameters:
priority - valid values are found in JobPriority

openIterator

public Iterator<Tuple> openIterator(String id)
                             throws IOException
Executes a Pig Latin script up to and including indicated alias. That is, if a user does:
 PigServer server = new PigServer();
 server.registerQuery("A = load 'foo';");
 server.registerQuery("B = filter A by $0 > 0;");
 server.registerQuery("C = order B by $1;");
 
Then
 server.openIterator("B");
 
filtered but unsorted data will be returned. If instead a user does
 server.openIterator("C");
 
filtered and sorted data will be returned.

Parameters:
id - Alias to open iterator for
Returns:
iterator of tuples returned from the script
Throws:
IOException

store

public ExecJob store(String id,
                     String filename)
              throws IOException
Executes a Pig Latin script up to and including indicated alias and stores the resulting records into a file. That is, if a user does:
 PigServer server = new PigServer();
 server.registerQuery("A = load 'foo';");
 server.registerQuery("B = filter A by $0 > 0;");
 server.registerQuery("C = order B by $1;");
 
Then
 server.store("B", "bar");
 
filtered but unsorted data will be stored to the file bar. If instead a user does
 server.store("C", "bar");
 
filtered and sorted data will be stored to the file bar. Equivalent to calling store(String, String, String) with org.apache.pig.PigStorage as the store function.

Parameters:
id - The alias to store
filename - The file to which to store to
Returns:
ExecJob containing information about this job
Throws:
IOException

store

public ExecJob store(String id,
                     String filename,
                     String func)
              throws IOException
Executes a Pig Latin script up to and including indicated alias and stores the resulting records into a file. That is, if a user does:
 PigServer server = new PigServer();
 server.registerQuery("A = load 'foo';");
 server.registerQuery("B = filter A by $0 > 0;");
 server.registerQuery("C = order B by $1;");
 
Then
 server.store("B", "bar", "mystorefunc");
 
filtered but unsorted data will be stored to the file bar using mystorefunc. If instead a user does
 server.store("C", "bar", "mystorefunc");
 
filtered and sorted data will be stored to the file bar using mystorefunc.

Parameters:
id - The alias to store
filename - The file to which to store to
func - store function to use
Returns:
ExecJob containing information about this job
Throws:
IOException

explain

public void explain(String alias,
                    PrintStream stream)
             throws IOException
Provide information on how a pig query will be executed. For now this information is very developer focussed, and probably not very useful to the average user.

Parameters:
alias - Name of alias to explain.
stream - PrintStream to write explanation to.
Throws:
IOException - if the requested alias cannot be found.

explain

public void explain(String alias,
                    String format,
                    boolean verbose,
                    boolean markAsExecute,
                    PrintStream lps,
                    PrintStream pps,
                    PrintStream eps)
             throws IOException
Provide information on how a pig query will be executed.

Parameters:
alias - Name of alias to explain.
format - Format in which the explain should be printed. If text, then the plan will be printed in plain text. Otherwise, the execution plan will be printed in DOT format.
verbose - Controls the amount of information printed
markAsExecute - When set will treat the explain like a call to execute in the respoect that all the pending stores are marked as complete.
lps - Stream to print the logical tree
pps - Stream to print the physical tree
eps - Stream to print the execution tree
Throws:
IOException - if the requested alias cannot be found.

capacity

public long capacity()
              throws IOException
Returns the unused byte capacity of an HDFS filesystem. This value does not take into account a replication factor, as that can vary from file to file. Thus if you are using this to determine if you data set will fit in the HDFS, you need to divide the result of this call by your specific replication setting.

Returns:
unused byte capacity of the file system.
Throws:
IOException

fileSize

public long fileSize(String filename)
              throws IOException
Returns the length of a file in bytes which exists in the HDFS (accounts for replication).

Parameters:
filename -
Returns:
length of the file in bytes
Throws:
IOException

existsFile

public boolean existsFile(String filename)
                   throws IOException
Test whether a file exists.

Parameters:
filename - to test
Returns:
true if file exists, false otherwise
Throws:
IOException

deleteFile

public boolean deleteFile(String filename)
                   throws IOException
Delete a file.

Parameters:
filename - to delete
Returns:
true
Throws:
IOException

renameFile

public boolean renameFile(String source,
                          String target)
                   throws IOException
Rename a file.

Parameters:
source - file to rename
target - new file name
Returns:
true
Throws:
IOException

mkdirs

public boolean mkdirs(String dirs)
               throws IOException
Make a directory.

Parameters:
dirs - directory to make
Returns:
true
Throws:
IOException

listPaths

public String[] listPaths(String dir)
                   throws IOException
List the contents of a directory.

Parameters:
dir - name of directory to list
Returns:
array of strings, one for each file name
Throws:
IOException

getAliases

public Map<String,LogicalPlan> getAliases()
Return a map containing the logical plan associated with each alias.

Returns:
map

shutdown

public void shutdown()
Reclaims resources used by this instance of PigServer. This method deletes all temporary files generated by the current thread while executing Pig commands.


getAliasKeySet

public Set<String> getAliasKeySet()
Get the set of all current aliases.

Returns:
set

getExamples

public Map<Operator,DataBag> getExamples(String alias)
                                  throws IOException
Throws:
IOException

printHistory

public void printHistory(boolean withNumbers)

launchPlan

protected PigStats launchPlan(PhysicalPlan pp,
                              String jobName)
                       throws ExecException,
                              FrontendException
A common method for launching the jobs according to the physical plan

Parameters:
pp - The physical plan
jobName - A String containing the job name to be used
Returns:
The PigStats object
Throws:
ExecException
FrontendException

setValidateEachStatement

public void setValidateEachStatement(boolean validateEachStatement)
This can be called to indicate if the query is being parsed/compiled in a mode that expects each statement to be validated as it is entered, instead of just doing it once for whole script.

Parameters:
validateEachStatement -


Copyright © 2007-2012 The Apache Software Foundation