Thursday, April 28, 2011
Speaker: Rodrigo Rodrigues
Venue: IST Austria
As online data sets grow over time, computations that process bulk data become increasingly more expensive. Furthermore, often the same computation runs on evolving data sets repeatedly, which implies that consecutive runs share a large fraction of their input. This motivates an incremental approach where results are incrementally updated as data evolves, instead of being recomputed from scratch. To achieve this, new systems for incremental bulk data processing have been proposed, such as Google’s Percolator or Yahoo!’s CBP. However, using these systems comes at a price of losing compatibility with the interface offered by systems like MapReduce, and more importantly, requiring the programmer to implement application-specific dynamic algorithms, which the literature shows to be difficult to develop and implement even for simple problems.
In this talk I will describe the architecture, implementation, and evaluation of incoop, a generic MapReduce framework for incremental computations. incoop detects changes to the inputs to computations and enables the automatic update of the outputs by employing an efficient, fine-grained propagation mechanism. By integrating the proposed approach into the MapReduce framework, we transparently benefit a large class of existing applications. Furthermore, by adopting principles from algorithms and programming languages research, we are able to systematically identify the shortcomings of an initial, task memoization-based approach, and address them using several novel techniques such as a new storage system to store the input of consecutive runs, a new contraction phase that make the incremental computation of the reduce tasks more efficient, and a new scheduling algorithm for Hadoop that is aware of the location of previously computed results.