org.apache.pig.data
Interface DataBag

All Superinterfaces:
Comparable, Iterable<Tuple>, Serializable, Spillable, org.apache.hadoop.io.Writable, org.apache.hadoop.io.WritableComparable
All Known Implementing Classes:
AccumulativeBag, DefaultAbstractBag, DefaultDataBag, DistinctDataBag, InternalCachedBag, InternalDistinctBag, InternalSortedBag, NonSpillableDataBag, ReadOnceBag, SingleTupleBag, SortedDataBag, SortedSpillBag

@InterfaceAudience.Public
@InterfaceStability.Stable
public interface DataBag
extends Spillable, org.apache.hadoop.io.WritableComparable, Iterable<Tuple>, Serializable

A collection of Tuples. A DataBag may or may not fit into memory. DataBag extends spillable, which means that it registers with a memory manager. By default, it attempts to keep all of its contents in memory. If it is asked by the memory manager to spill to disk (by a call to spill()), it takes whatever it has in memory, opens a spill file, and writes the contents out. This may happen multiple times. The bag tracks all of the files it's spilled to.

DataBag provides an Iterator interface, that allows callers to read through the contents. The iterators are aware of the data spilling. They have to be able to handle reading from files, as well as the fact that data they were reading from memory may have been spilled to disk underneath them.

The DataBag interface assumes that all data is written before any is read. That is, a DataBag cannot be used as a queue. If data is written after data is read, the results are undefined. This condition is not checked on each add or read, for reasons of speed. Caveat emptor.

Since spills are asynchronous (the memory manager requesting a spill runs in a separate thread), all operations dealing with the mContents Collection (which is the collection of tuples contained in the bag) have to be synchronized. This means that reading from a DataBag is currently serialized. This is ok for the moment because pig execution is currently single threaded. A ReadWriteLock was experimented with, but it was found to be about 10x slower than using the synchronize keyword. If pig changes its execution model to be multithreaded, we may need to return to this issue, as synchronizing reads will most likely defeat the purpose of multi-threading execution.

DataBags come in several types, default, sorted, and distinct. The type must be chosen up front, there is no way to convert a bag on the fly. Default data bags do not guarantee any particular order of retrieval for the tuples and may contain duplicate tuples. Sorted data bags guarantee that tuples will be retrieved in order, where "in order" is defined either by the default comparator for Tuple or the comparator provided by the caller when the bag was created. Sorted bags may contain duplicates. Distinct bags do not guarantee any particular order of retrieval, but do guarantee that they will not contain duplicate tuples.


Method Summary
 void add(Tuple t)
          Add a tuple to the bag.
 void addAll(DataBag b)
          Add contents of a bag to the bag.
 void clear()
          Clear out the contents of the bag, both on disk and in memory.
 boolean isDistinct()
          Find out if the bag is distinct.
 boolean isSorted()
          Find out if the bag is sorted.
 Iterator<Tuple> iterator()
          Get an iterator to the bag.
 void markStale(boolean stale)
          This is used by FuncEvalSpec.FakeDataBag.
 long size()
          Get the number of elements in the bag, both in memory and on disk.
 
Methods inherited from interface org.apache.pig.impl.util.Spillable
getMemorySize, spill
 
Methods inherited from interface org.apache.hadoop.io.Writable
readFields, write
 
Methods inherited from interface java.lang.Comparable
compareTo
 

Method Detail

size

long size()
Get the number of elements in the bag, both in memory and on disk.

Returns:
number of elements in the bag

isSorted

boolean isSorted()
Find out if the bag is sorted.

Returns:
true if this is a sorted data bag, false otherwise.

isDistinct

boolean isDistinct()
Find out if the bag is distinct.

Returns:
true if the bag is a distinct bag, false otherwise.

iterator

Iterator<Tuple> iterator()
Get an iterator to the bag. For default and distinct bags, no particular order is guaranteed. For sorted bags the order is guaranteed to be sorted according to the provided comparator.

Specified by:
iterator in interface Iterable<Tuple>
Returns:
tuple iterator

add

void add(Tuple t)
Add a tuple to the bag.

Parameters:
t - tuple to add.

addAll

void addAll(DataBag b)
Add contents of a bag to the bag.

Parameters:
b - bag to add contents of.

clear

void clear()
Clear out the contents of the bag, both on disk and in memory. Any attempts to read after this is called will produce undefined results.


markStale

@InterfaceAudience.Private
void markStale(boolean stale)
This is used by FuncEvalSpec.FakeDataBag.

Parameters:
stale - Set stale state.


Copyright © ${year} The Apache Software Foundation