org.apache.pig.builtin
Class Bloom

java.lang.Object
  extended by org.apache.pig.EvalFunc<Boolean>
      extended by org.apache.pig.FilterFunc
          extended by org.apache.pig.builtin.Bloom

public class Bloom
extends FilterFunc

Use a Bloom filter build previously by BuildBloom. You would first build a bloom filter in a group all job. For example: in a group all job. For example: define bb BuildBloom('jenkins', '100', '0.1'); A = load 'foo' as (x, y); B = group A all; C = foreach B generate bb(A.x); store C into 'mybloom'; The bloom filter can be on multiple keys by passing more than one field (or the entire bag) to BuildBloom. The resulting file can then be used in a Bloom filter as: define bloom Bloom(mybloom); A = load 'foo' as (x, y); B = load 'bar' as (z); C = filter B by bloom(z); D = join C by z, A by x; It uses BloomFilter.


Field Summary
 org.apache.hadoop.util.bloom.BloomFilter filter
           
 
Fields inherited from class org.apache.pig.EvalFunc
log, pigLogger, reporter, returnType
 
Constructor Summary
Bloom(String filename)
           
 
Method Summary
 Boolean exec(Tuple input)
          This callback method must be implemented by all subclasses.
 List<String> getCacheFiles()
          Allow a UDF to specify a list of files it would like placed in the distributed cache.
 void setFilter(DataByteArray dba)
          For testing only, do not use directly.
 
Methods inherited from class org.apache.pig.FilterFunc
finish
 
Methods inherited from class org.apache.pig.EvalFunc
getArgToFuncMapping, getLogger, getPigLogger, getReporter, getReturnType, getSchemaName, isAsynchronous, outputSchema, progress, setPigLogger, setReporter, warn
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

filter

public org.apache.hadoop.util.bloom.BloomFilter filter
Constructor Detail

Bloom

public Bloom(String filename)
Parameters:
filename - file containing the serialized Bloom filter
Method Detail

exec

public Boolean exec(Tuple input)
             throws IOException
Description copied from class: EvalFunc
This callback method must be implemented by all subclasses. This is the method that will be invoked on every Tuple of a given dataset. Since the dataset may be divided up in a variety of ways the programmer should not make assumptions about state that is maintained between invocations of this method.

Specified by:
exec in class EvalFunc<Boolean>
Parameters:
input - the Tuple to be processed.
Returns:
result, of type T.
Throws:
IOException

getCacheFiles

public List<String> getCacheFiles()
Description copied from class: EvalFunc
Allow a UDF to specify a list of files it would like placed in the distributed cache. These files will be put in the cache for every job the UDF is used in. The default implementation returns null.

Overrides:
getCacheFiles in class EvalFunc<Boolean>
Returns:
A list of files

setFilter

public void setFilter(DataByteArray dba)
               throws IOException
For testing only, do not use directly.

Throws:
IOException


Copyright © 2007-2012 The Apache Software Foundation