org.apache.pig
Class PigServer

java.lang.Object
  extended by org.apache.pig.PigServer
Direct Known Subclasses:
ToolsPigServer

@InterfaceAudience.Public
@InterfaceStability.Stable
public class PigServer
extends Object

A class for Java programs to connect to Pig. Typically a program will create a PigServer instance. The programmer then registers queries using registerQuery() and retrieves results using openIterator() or store(). After doing so, the shutdown() method should be called to free any resources used by the current PigServer instance. Not doing so could result in a memory leak.


Nested Class Summary
protected  class PigServer.Graph
           
 
Field Summary
protected  Deque<PigServer.Graph> graphs
           
protected  org.apache.commons.logging.Log log
           
protected  PigContext pigContext
           
static String PRETTY_PRINT_SCHEMA_PROPERTY
           
protected  String scope
           
 
Constructor Summary
PigServer(ExecType execType)
           
PigServer(ExecType execType, org.apache.hadoop.conf.Configuration conf)
           
PigServer(ExecType execType, Properties properties)
           
PigServer(PigContext context)
           
PigServer(PigContext context, boolean connect)
           
PigServer(Properties properties)
           
PigServer(String execTypeString)
           
 
Method Summary
 void addPathToSkip(String path)
          Add a path to be skipped while automatically shipping binaries for streaming.
 long capacity()
          Returns the unused byte capacity of an HDFS filesystem.
 void debugOff()
          Set the logging level to the default.
 void debugOn()
          Set the logging level to DEBUG.
 boolean deleteFile(String filename)
          Delete a file.
 void discardBatch()
          Discards a batch of Pig commands.
 Schema dumpSchema(String alias)
          Write the schema for an alias to System.out.
 Schema dumpSchemaNested(String alias, String nestedAlias)
          Write the schema for a nestedAlias to System.out.
 List<ExecJob> executeBatch()
          Submits a batch of Pig commands for execution.
 List<ExecJob> executeBatch(boolean parseAndBuild)
          Submits a batch of Pig commands for execution.
 boolean existsFile(String filename)
          Test whether a file exists.
 void explain(String alias, PrintStream stream)
          Provide information on how a pig query will be executed.
 void explain(String alias, String format, boolean verbose, boolean markAsExecute, PrintStream lps, PrintStream eps, File dir, String suffix)
          Provide information on how a pig query will be executed.
 long fileSize(String filename)
          Returns the length of a file in bytes which exists in the HDFS (accounts for replication).
 Map<String,LogicalPlan> getAliases()
          Return a map containing the logical plan associated with each alias.
 Set<String> getAliasKeySet()
          Get the set of all current aliases.
protected  PigServer.Graph getClonedGraph()
          Creates a clone of the current DAG
 PigServer.Graph getCurrentDAG()
          Current DAG
 Map<Operator,DataBag> getExamples(String alias)
           
protected  List<ExecJob> getJobs(PigStats stats)
          Retrieves a list of Job objects from the PigStats object
 String getLastRel()
           
 LogicalPlanData getLogicalPlanData()
          Returns data associated with LogicalPlan.
 PigContext getPigContext()
           
 boolean isBatchEmpty()
          Returns whether there is anything to process in the current batch.
 boolean isBatchOn()
          Retrieve the current execution mode.
protected  PigStats launchPlan(LogicalPlan lp, String jobName)
          A common method for launching the jobs according to the logical plan
 String[] listPaths(String dir)
          List the contents of a directory.
 boolean mkdirs(String dirs)
          Make a directory.
 Iterator<Tuple> openIterator(String id)
          Executes a Pig Latin script up to and including indicated alias.
protected  List<String> paramMapToList(Map<String,String> params)
           
 void parseAndBuild()
          This method parses the scripts and builds the LogicalPlan.
 void printAliases()
          Intended to be used by unit tests only.
 void printHistory(boolean withNumbers)
           
 void registerCode(String path, String scriptingLang, String namespace)
          Universal Scripting Language Support, see PIG-928
 void registerFunction(String function, FuncSpec funcSpec)
          Defines an alias for the given function spec.
 void registerJar(String name)
          Registers a jar file.
 void registerQuery(String query)
          Register a query with the Pig runtime.
 void registerQuery(String query, int startLine)
          Register a query with the Pig runtime.
 void registerScript(InputStream in)
          Register a pig script from InputStream source which is more general and extensible the pig script can be from local file, then you can use FileInputStream.
 void registerScript(InputStream in, List<String> paramsFiles)
          Register a pig script from InputStream source which is more general and extensible the pig script can be from local file, then you can use FileInputStream.
 void registerScript(InputStream in, Map<String,String> params)
          Register a pig script from InputStream source which is more general and extensible the pig script can be from local file, then you can use FileInputStream.
 void registerScript(InputStream in, Map<String,String> params, List<String> paramsFiles)
          Register a pig script from InputStream.
The pig script can be from local file, then you can use FileInputStream.
 void registerScript(String fileName)
          Register a query with the Pig runtime.
 void registerScript(String fileName, List<String> paramsFiles)
          Register a pig script file.
 void registerScript(String fileName, Map<String,String> params)
          Register a pig script file.
 void registerScript(String fileName, Map<String,String> params, List<String> paramsFiles)
          Register a pig script file.
 void registerStreamingCommand(String commandAlias, StreamingCommand command)
          Defines an alias for the given streaming command.
 boolean renameFile(String source, String target)
          Rename a file.
 void setBatchOn()
          Starts batch execution mode.
 void setDefaultParallel(int p)
          Set the default parallelism for this job
 void setJobName(String name)
          Set the name of the job.
 void setJobPriority(String priority)
          Set Hadoop job priority.
 void setSkipParseInRegisterForBatch(boolean skipParseInRegisterForBatch)
          Set whether to skip parsing while registering the query in batch mode
 void setValidateEachStatement(boolean validateEachStatement)
          This can be called to indicate if the query is being parsed/compiled in a mode that expects each statement to be validated as it is entered, instead of just doing it once for whole script.
 void shutdown()
          Reclaims resources used by this instance of PigServer.
 ExecJob store(String id, String filename)
          Executes a Pig Latin script up to and including indicated alias and stores the resulting records into a file.
 ExecJob store(String id, String filename, String func)
          Executes a Pig Latin script up to and including indicated alias and stores the resulting records into a file.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

log

protected final org.apache.commons.logging.Log log

PRETTY_PRINT_SCHEMA_PROPERTY

public static final String PRETTY_PRINT_SCHEMA_PROPERTY
See Also:
Constant Field Values

graphs

protected final Deque<PigServer.Graph> graphs

pigContext

protected final PigContext pigContext

scope

protected final String scope
Constructor Detail

PigServer

public PigServer(String execTypeString)
          throws ExecException,
                 IOException
Parameters:
execTypeString - can be 'mapreduce' or 'local'. Local mode will use Hadoop's local job runner to execute the job on the local machine. Mapreduce mode will connect to a cluster to execute the job. If execTypeString is not one of these two, Pig will deduce the ExecutionEngine if it is on the classpath and use it for the backend execution.
Throws:
ExecException
IOException

PigServer

public PigServer(Properties properties)
          throws ExecException,
                 IOException
Throws:
ExecException
IOException

PigServer

public PigServer(ExecType execType)
          throws ExecException
Parameters:
execType - execution type to start the engine. Local mode will use Hadoop's local job runner to execute the job on the local machine. Mapreduce mode will connect to a cluster to execute the job.
Throws:
ExecException

PigServer

public PigServer(ExecType execType,
                 Properties properties)
          throws ExecException
Throws:
ExecException

PigServer

public PigServer(ExecType execType,
                 org.apache.hadoop.conf.Configuration conf)
          throws ExecException
Throws:
ExecException

PigServer

public PigServer(PigContext context)
          throws ExecException
Throws:
ExecException

PigServer

public PigServer(PigContext context,
                 boolean connect)
          throws ExecException
Throws:
ExecException
Method Detail

getPigContext

public PigContext getPigContext()

getCurrentDAG

public PigServer.Graph getCurrentDAG()
Current DAG

Returns:

debugOn

public void debugOn()
Set the logging level to DEBUG.


debugOff

public void debugOff()
Set the logging level to the default.


setDefaultParallel

public void setDefaultParallel(int p)
Set the default parallelism for this job

Parameters:
p - default number of reducers to use for this job.

setBatchOn

public void setBatchOn()
Starts batch execution mode.


isBatchOn

public boolean isBatchOn()
Retrieve the current execution mode.

Returns:
true if the execution mode is batch; false otherwise.

isBatchEmpty

public boolean isBatchEmpty()
                     throws FrontendException
Returns whether there is anything to process in the current batch.

Returns:
true if there are no stores to process in the current batch, false otherwise.
Throws:
FrontendException

parseAndBuild

public void parseAndBuild()
                   throws IOException
This method parses the scripts and builds the LogicalPlan. This method should be followed by executeBatch(boolean) with argument as false. Do Not use executeBatch() after calling this method as that will re-parse and build the script.

Throws:
IOException

executeBatch

public List<ExecJob> executeBatch()
                           throws IOException
Submits a batch of Pig commands for execution.

Returns:
list of jobs being executed
Throws:
IOException

executeBatch

public List<ExecJob> executeBatch(boolean parseAndBuild)
                           throws IOException
Submits a batch of Pig commands for execution. Parse and build of script should be skipped if user called parseAndBuild() before. Pass false as an argument in which case.

Parameters:
parseAndBuild -
Returns:
Throws:
IOException

getJobs

protected List<ExecJob> getJobs(PigStats stats)
Retrieves a list of Job objects from the PigStats object

Parameters:
stats -
Returns:
A list of ExecJob objects

discardBatch

public void discardBatch()
                  throws FrontendException
Discards a batch of Pig commands.

Throws:
FrontendException

addPathToSkip

public void addPathToSkip(String path)
Add a path to be skipped while automatically shipping binaries for streaming.

Parameters:
path - path to be skipped

registerFunction

public void registerFunction(String function,
                             FuncSpec funcSpec)
Defines an alias for the given function spec. This is useful for functions that require arguments to the constructor.

Parameters:
function - - the new function alias to define.
funcSpec - - the FuncSpec object representing the name of the function class and any arguments to constructor.

registerStreamingCommand

public void registerStreamingCommand(String commandAlias,
                                     StreamingCommand command)
Defines an alias for the given streaming command.

Parameters:
commandAlias - - the new command alias to define
command - - streaming command to be executed

registerJar

public void registerJar(String name)
                 throws IOException
Registers a jar file. Name of the jar file can be an absolute or relative path. If multiple resources are found with the specified name, the first one is registered as returned by getSystemResources. A warning is issued to inform the user.

Parameters:
name - of the jar file to register
Throws:
IOException

registerCode

public void registerCode(String path,
                         String scriptingLang,
                         String namespace)
                  throws IOException
Universal Scripting Language Support, see PIG-928

Parameters:
path - path of the script file
scriptingLang - language keyword or scriptingEngine used to interpret the script
namespace - namespace defined for functions of this script
Throws:
IOException

registerQuery

public void registerQuery(String query,
                          int startLine)
                   throws IOException
Register a query with the Pig runtime. The query is parsed and registered, but it is not executed until it is needed.

Parameters:
query - a Pig Latin expression to be evaluated.
startLine - line number of the query within the whole script
Throws:
IOException

registerQuery

public void registerQuery(String query)
                   throws IOException
Register a query with the Pig runtime. The query is parsed and registered, but it is not executed until it is needed. Equivalent to calling registerQuery(String, int) with startLine set to 1.

Parameters:
query - a Pig Latin expression to be evaluated.
Throws:
IOException

registerScript

public void registerScript(InputStream in)
                    throws IOException
Register a pig script from InputStream source which is more general and extensible the pig script can be from local file, then you can use FileInputStream. or pig script can be in memory which you build it dynamically, the you can use ByteArrayInputStream even pig script can be in remote machine, which you get wrap it as SocketInputStream

Parameters:
in -
Throws:
IOException

registerScript

public void registerScript(InputStream in,
                           Map<String,String> params)
                    throws IOException
Register a pig script from InputStream source which is more general and extensible the pig script can be from local file, then you can use FileInputStream. or pig script can be in memory which you build it dynamically, the you can use ByteArrayInputStream even pig script can be in remote machine, which you get wrap it as SocketInputStream. The parameters in the pig script will be substituted with the values in params

Parameters:
in -
params - the key is the parameter name, and the value is the parameter value
Throws:
IOException

registerScript

public void registerScript(InputStream in,
                           List<String> paramsFiles)
                    throws IOException
Register a pig script from InputStream source which is more general and extensible the pig script can be from local file, then you can use FileInputStream. or pig script can be in memory which you build it dynamically, the you can use ByteArrayInputStream even pig script can be in remote machine, which you get wrap it as SocketInputStream The parameters in the pig script will be substituted with the values in the parameter files

Parameters:
in -
paramsFiles - files which have the parameter setting
Throws:
IOException

registerScript

public void registerScript(InputStream in,
                           Map<String,String> params,
                           List<String> paramsFiles)
                    throws IOException
Register a pig script from InputStream.
The pig script can be from local file, then you can use FileInputStream. Or pig script can be in memory which you build it dynamically, the you can use ByteArrayInputStream Pig script can even be in remote machine, which you get wrap it as SocketInputStream.
The parameters in the pig script will be substituted with the values in the map and the parameter files. The values in params Map will override the value in parameter file if they have the same parameter

Parameters:
in -
params - the key is the parameter name, and the value is the parameter value
paramsFiles - files which have the parameter setting
Throws:
IOException

paramMapToList

protected List<String> paramMapToList(Map<String,String> params)

getClonedGraph

protected PigServer.Graph getClonedGraph()
                                  throws IOException
Creates a clone of the current DAG

Returns:
A Graph object which is a clone of the current DAG
Throws:
IOException

registerScript

public void registerScript(String fileName)
                    throws IOException
Register a query with the Pig runtime. The query will be read from the indicated file.

Parameters:
fileName - file to read query from.
Throws:
IOException

registerScript

public void registerScript(String fileName,
                           Map<String,String> params)
                    throws IOException
Register a pig script file. The parameters in the file will be substituted with the values in params

Parameters:
fileName - pig script file
params - the key is the parameter name, and the value is the parameter value
Throws:
IOException

registerScript

public void registerScript(String fileName,
                           List<String> paramsFiles)
                    throws IOException
Register a pig script file. The parameters in the file will be substituted with the values in the parameter files

Parameters:
fileName - pig script file
paramsFiles - files which have the parameter setting
Throws:
IOException

registerScript

public void registerScript(String fileName,
                           Map<String,String> params,
                           List<String> paramsFiles)
                    throws IOException
Register a pig script file. The parameters in the file will be substituted with the values in the map and the parameter files The values in params Map will override the value in parameter file if they have the same parameter

Parameters:
fileName - pig script
params - the key is the parameter name, and the value is the parameter value
paramsFiles - files which have the parameter setting
Throws:
IOException

printAliases

public void printAliases()
                  throws FrontendException
Intended to be used by unit tests only. Print a list of all aliases in in the current Pig Latin script. Output is written to System.out.

Throws:
FrontendException

dumpSchema

public Schema dumpSchema(String alias)
                  throws IOException
Write the schema for an alias to System.out.

Parameters:
alias - Alias whose schema will be written out
Returns:
Schema of alias dumped
Throws:
IOException

dumpSchemaNested

public Schema dumpSchemaNested(String alias,
                               String nestedAlias)
                        throws IOException
Write the schema for a nestedAlias to System.out. Denoted by alias::nestedAlias.

Parameters:
alias - Alias whose schema has nestedAlias
nestedAlias - Alias whose schema will be written out
Returns:
Schema of alias dumped
Throws:
IOException

setJobName

public void setJobName(String name)
Set the name of the job. This name will get translated to mapred.job.name.

Parameters:
name - of job

setJobPriority

public void setJobPriority(String priority)
Set Hadoop job priority. This value will get translated to mapred.job.priority.

Parameters:
priority - valid values are found in JobPriority

openIterator

public Iterator<Tuple> openIterator(String id)
                             throws IOException
Executes a Pig Latin script up to and including indicated alias. That is, if a user does:
 PigServer server = new PigServer();
 server.registerQuery("A = load 'foo';");
 server.registerQuery("B = filter A by $0 > 0;");
 server.registerQuery("C = order B by $1;");
 
Then
 server.openIterator("B");
 
filtered but unsorted data will be returned. If instead a user does
 server.openIterator("C");
 
filtered and sorted data will be returned.

Parameters:
id - Alias to open iterator for
Returns:
iterator of tuples returned from the script
Throws:
IOException

store

public ExecJob store(String id,
                     String filename)
              throws IOException
Executes a Pig Latin script up to and including indicated alias and stores the resulting records into a file. That is, if a user does:
 PigServer server = new PigServer();
 server.registerQuery("A = load 'foo';");
 server.registerQuery("B = filter A by $0 > 0;");
 server.registerQuery("C = order B by $1;");
 
Then
 server.store("B", "bar");
 
filtered but unsorted data will be stored to the file bar. If instead a user does
 server.store("C", "bar");
 
filtered and sorted data will be stored to the file bar. Equivalent to calling store(String, String, String) with org.apache.pig.PigStorage as the store function.

Parameters:
id - The alias to store
filename - The file to which to store to
Returns:
ExecJob containing information about this job
Throws:
IOException

store

public ExecJob store(String id,
                     String filename,
                     String func)
              throws IOException
Executes a Pig Latin script up to and including indicated alias and stores the resulting records into a file. That is, if a user does:
 PigServer server = new PigServer();
 server.registerQuery("A = load 'foo';");
 server.registerQuery("B = filter A by $0 > 0;");
 server.registerQuery("C = order B by $1;");
 
Then
 server.store("B", "bar", "mystorefunc");
 
filtered but unsorted data will be stored to the file bar using mystorefunc. If instead a user does
 server.store("C", "bar", "mystorefunc");
 
filtered and sorted data will be stored to the file bar using mystorefunc.

Parameters:
id - The alias to store
filename - The file to which to store to
func - store function to use
Returns:
ExecJob containing information about this job
Throws:
IOException

explain

public void explain(String alias,
                    PrintStream stream)
             throws IOException
Provide information on how a pig query will be executed. For now this information is very developer focussed, and probably not very useful to the average user.

Parameters:
alias - Name of alias to explain.
stream - PrintStream to write explanation to.
Throws:
IOException - if the requested alias cannot be found.

explain

public void explain(String alias,
                    String format,
                    boolean verbose,
                    boolean markAsExecute,
                    PrintStream lps,
                    PrintStream eps,
                    File dir,
                    String suffix)
             throws IOException
Provide information on how a pig query will be executed.

Parameters:
alias - Name of alias to explain.
format - Format in which the explain should be printed. If text, then the plan will be printed in plain text. Otherwise, the execution plan will be printed in DOT format.
verbose - Controls the amount of information printed
markAsExecute - When set will treat the explain like a call to execute in the respoect that all the pending stores are marked as complete.
lps - Stream to print the logical tree
eps - Stream to print the ExecutionEngine trees. If null, then will print to files
dir - Directory to print ExecutionEngine trees. If null, will use eps
suffix - Suffix of file names
Throws:
IOException - if the requested alias cannot be found.

capacity

public long capacity()
              throws IOException
Returns the unused byte capacity of an HDFS filesystem. This value does not take into account a replication factor, as that can vary from file to file. Thus if you are using this to determine if you data set will fit in the HDFS, you need to divide the result of this call by your specific replication setting.

Returns:
unused byte capacity of the file system.
Throws:
IOException

fileSize

public long fileSize(String filename)
              throws IOException
Returns the length of a file in bytes which exists in the HDFS (accounts for replication).

Parameters:
filename -
Returns:
length of the file in bytes
Throws:
IOException

existsFile

public boolean existsFile(String filename)
                   throws IOException
Test whether a file exists.

Parameters:
filename - to test
Returns:
true if file exists, false otherwise
Throws:
IOException

deleteFile

public boolean deleteFile(String filename)
                   throws IOException
Delete a file.

Parameters:
filename - to delete
Returns:
true
Throws:
IOException

renameFile

public boolean renameFile(String source,
                          String target)
                   throws IOException
Rename a file.

Parameters:
source - file to rename
target - new file name
Returns:
true
Throws:
IOException

mkdirs

public boolean mkdirs(String dirs)
               throws IOException
Make a directory.

Parameters:
dirs - directory to make
Returns:
true
Throws:
IOException

listPaths

public String[] listPaths(String dir)
                   throws IOException
List the contents of a directory.

Parameters:
dir - name of directory to list
Returns:
array of strings, one for each file name
Throws:
IOException

getAliases

public Map<String,LogicalPlan> getAliases()
Return a map containing the logical plan associated with each alias.

Returns:
map

shutdown

public void shutdown()
Reclaims resources used by this instance of PigServer. This method deletes all temporary files generated by the current thread while executing Pig commands.


getAliasKeySet

public Set<String> getAliasKeySet()
Get the set of all current aliases.

Returns:
set

getExamples

public Map<Operator,DataBag> getExamples(String alias)
                                  throws IOException
Throws:
IOException

printHistory

public void printHistory(boolean withNumbers)

launchPlan

protected PigStats launchPlan(LogicalPlan lp,
                              String jobName)
                       throws ExecException,
                              FrontendException
A common method for launching the jobs according to the logical plan

Parameters:
lp - The logical plan
jobName - A String containing the job name to be used
Returns:
The PigStats object
Throws:
ExecException
FrontendException

getLogicalPlanData

public LogicalPlanData getLogicalPlanData()
Returns data associated with LogicalPlan. It makes sense to call this method only after a query/script has been registered with one of the registerQuery(String) or registerScript(InputStream) methods.

Returns:
LogicalPlanData

setValidateEachStatement

public void setValidateEachStatement(boolean validateEachStatement)
This can be called to indicate if the query is being parsed/compiled in a mode that expects each statement to be validated as it is entered, instead of just doing it once for whole script.

Parameters:
validateEachStatement -

setSkipParseInRegisterForBatch

public void setSkipParseInRegisterForBatch(boolean skipParseInRegisterForBatch)
Set whether to skip parsing while registering the query in batch mode

Parameters:
skipParseInRegisterForBatch -

getLastRel

public String getLastRel()


Copyright © 2007-2012 The Apache Software Foundation