org.apache.pig.data
Class InternalDistinctBag

java.lang.Object
  extended by org.apache.pig.data.DefaultAbstractBag
      extended by org.apache.pig.data.SelfSpillBag
          extended by org.apache.pig.data.SortedSpillBag
              extended by org.apache.pig.data.InternalDistinctBag
All Implemented Interfaces:
Serializable, Comparable, Iterable<Tuple>, org.apache.hadoop.io.Writable, org.apache.hadoop.io.WritableComparable, DataBag, Spillable

@InterfaceAudience.Private
@InterfaceStability.Evolving
public class InternalDistinctBag
extends SortedSpillBag

An unordered collection of Tuples with no multiples. Data is stored without duplicates as it comes in. When it is time to spill, that data is sorted and written to disk. The data is stored in a HashSet. When it is time to sort it is placed in an ArrayList and then sorted. Dispite all these machinations, this was found to be faster than storing it in a TreeSet. This bag spills pro-actively when the number of tuples in memory reaches a limit

See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class org.apache.pig.data.SelfSpillBag
SelfSpillBag.MemoryLimits
 
Nested classes/interfaces inherited from class org.apache.pig.data.DefaultAbstractBag
DefaultAbstractBag.BagDelimiterTuple, DefaultAbstractBag.EndBag, DefaultAbstractBag.StartBag
 
Field Summary
 
Fields inherited from class org.apache.pig.data.SelfSpillBag
memLimit
 
Fields inherited from class org.apache.pig.data.DefaultAbstractBag
endBag, MAX_SPILL_FILES, mContents, mSize, mSpillFiles, startBag
 
Constructor Summary
InternalDistinctBag()
           
InternalDistinctBag(int bagCount)
           
InternalDistinctBag(int bagCount, float percent)
           
 
Method Summary
 void add(Tuple t)
          Add a tuple to the bag.
 boolean isDistinct()
          Find out if the bag is distinct.
 boolean isSorted()
          Find out if the bag is sorted.
 Iterator<Tuple> iterator()
          Get an iterator to the bag.
 long size()
          Get the number of elements in the bag, both in memory and on disk.
 long spill()
          Instructs an object to spill whatever it can to disk and release references to any data structures it spills.
 
Methods inherited from class org.apache.pig.data.SortedSpillBag
proactive_spill
 
Methods inherited from class org.apache.pig.data.DefaultAbstractBag
addAll, addAll, addAll, clear, compareTo, equals, getMemorySize, getSpillFile, hashCode, incSpillCount, incSpillCount, markSpillableIfNecessary, markStale, readFields, reportProgress, sampleContents, toString, warn, write
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Constructor Detail

InternalDistinctBag

public InternalDistinctBag()

InternalDistinctBag

public InternalDistinctBag(int bagCount)

InternalDistinctBag

public InternalDistinctBag(int bagCount,
                           float percent)
Method Detail

isSorted

public boolean isSorted()
Description copied from interface: DataBag
Find out if the bag is sorted.

Returns:
true if this is a sorted data bag, false otherwise.

isDistinct

public boolean isDistinct()
Description copied from interface: DataBag
Find out if the bag is distinct.

Returns:
true if the bag is a distinct bag, false otherwise.

size

public long size()
Description copied from class: DefaultAbstractBag
Get the number of elements in the bag, both in memory and on disk.

Specified by:
size in interface DataBag
Overrides:
size in class DefaultAbstractBag
Returns:
number of elements in the bag

iterator

public Iterator<Tuple> iterator()
Description copied from interface: DataBag
Get an iterator to the bag. For default and distinct bags, no particular order is guaranteed. For sorted bags the order is guaranteed to be sorted according to the provided comparator.

Returns:
tuple iterator

add

public void add(Tuple t)
Description copied from class: DefaultAbstractBag
Add a tuple to the bag.

Specified by:
add in interface DataBag
Overrides:
add in class DefaultAbstractBag
Parameters:
t - tuple to add.

spill

public long spill()
Description copied from interface: Spillable
Instructs an object to spill whatever it can to disk and release references to any data structures it spills.

Returns:
number of objects spilled.


Copyright © 2007-2012 The Apache Software Foundation