@InterfaceAudience.Public @InterfaceStability.Stable public interface DataBag extends Spillable, org.apache.hadoop.io.WritableComparable, Iterable<Tuple>, Serializable
DataBag provides an Iterator interface, that allows callers to read through the contents. The iterators are aware of the data spilling. They have to be able to handle reading from files, as well as the fact that data they were reading from memory may have been spilled to disk underneath them.
The DataBag interface assumes that all data is written before any is read. That is, a DataBag cannot be used as a queue. If data is written after data is read, the results are undefined. This condition is not checked on each add or read, for reasons of speed. Caveat emptor.
Since spills are asynchronous (the memory manager requesting a spill runs in a separate thread), all operations dealing with the mContents Collection (which is the collection of tuples contained in the bag) have to be synchronized. This means that reading from a DataBag is currently serialized. This is ok for the moment because pig execution is currently single threaded. A ReadWriteLock was experimented with, but it was found to be about 10x slower than using the synchronize keyword. If pig changes its execution model to be multithreaded, we may need to return to this issue, as synchronizing reads will most likely defeat the purpose of multi-threading execution.
DataBags come in several types, default, sorted, and distinct. The type must be chosen up front, there is no way to convert a bag on the fly. Default data bags do not guarantee any particular order of retrieval for the tuples and may contain duplicate tuples. Sorted data bags guarantee that tuples will be retrieved in order, where "in order" is defined either by the default comparator for Tuple or the comparator provided by the caller when the bag was created. Sorted bags may contain duplicates. Distinct bags do not guarantee any particular order of retrieval, but do guarantee that they will not contain duplicate tuples.
Modifier and Type | Method and Description |
---|---|
void |
add(Tuple t)
Add a tuple to the bag.
|
void |
addAll(DataBag b)
Add contents of a bag to the bag.
|
void |
clear()
Clear out the contents of the bag, both on disk and in memory.
|
boolean |
isDistinct()
Find out if the bag is distinct.
|
boolean |
isSorted()
Find out if the bag is sorted.
|
Iterator<Tuple> |
iterator()
Get an iterator to the bag.
|
void |
markStale(boolean stale)
This is used by FuncEvalSpec.FakeDataBag.
|
long |
size()
Get the number of elements in the bag, both in memory and on disk.
|
getMemorySize, spill
compareTo
long size()
boolean isSorted()
boolean isDistinct()
Iterator<Tuple> iterator()
void add(Tuple t)
t
- tuple to add.void addAll(DataBag b)
b
- bag to add contents of.void clear()
@InterfaceAudience.Private void markStale(boolean stale)
stale
- Set stale state.Copyright © 2007-2012 The Apache Software Foundation