Online Testing of Deployed Federated and Heterogeneous Distributed Systems

Date: Thursday, October 20, 2011
Speaker: Dejan Kostic
Venue: IST Austria

It is notoriously difficult to make distributed systems reliable. Thisbecomes even harder in the case of the widely-deployed systems thatare heterogeneous (multiple implementations) and federated (multipleadministrative entities). The set of routers in charge of theInternet’s inter-domain routing is a prime example of such a system.
We argue that a key step in making these systems reliable is the needto automatically explore the system behavior to check for potentialfaults. In this talk, I will describe the design and implementationof DiCE, a system for online testing of heterogeneous and federateddistributed systems. DiCE runs concurrently with the production systemby leveraging distributed checkpoints and isolated communicationchannels. DiCE orchestrates the exploration of relevant system statesby controlling the inputs that drive system actions. While respectingprivacy among different administrative entities, DiCE detects faultsby checking for violations of properties that capture the desiredsystem behavior. We demonstrate the ease of integrating DiCE with aBGP router and a DNS server, the building blocks of two vital servicesin the Internet. Our evaluation in the testbed shows that DiCE quicklyand successfully detects three important classes of faults, resultingfrom configuration mistakes, policy conflicts and programming errors.
Joint work with Marco Canini, Vojin Jovanovic, Daniele Venzano, GautamKumar, Dejan Novakovic, Boris Spasojevic, and Olivier Crameri


