Really Big Data: Analytics on Graphs with Trillions of Edges
Big graphs occur naturally in many applications, most obviously in social networks, but also in many other areas such as biology and forensics. Current approaches to processing large graphs use either supercomputers or very large clusters. In both cases the entire graph must reside in memory before it can be processed. We are pursuing an alternative approach, processing graphs from secondary storage. While this comes with a performance penalty, it makes analytics on very large graphs feasible on a small number of commodity machines.
We have developed two systems, one for a single machine and one for a cluster of machines. X-Stream, the single machine solution, aims to make all secondary storage access sequential. It uses two techniques to achieve this goal, edge-centric processing and streaming partitions. Chaos, the cluster solution, starts from the observation that there is little benefit to locality when accessing data from secondary storage over a high-speed network. As a result, Chaos spreads graph data uniformly randomly over storage devices, and uses randomized access to achieve I/O balance. Chaos furthermore uses work stealing to achieve computational load balance. By using these techniques, it avoids the need for expensive partitioning during pre-processing, while still achieving good scaling behavior. With Chaos we have been able to process an 8-trillion-edge graph on 32 machines, a new milestone for graph size on a small cluster. I will describe both systems and their performance on a number of benchmarks and in comparison to state-of-the-art alternatives.
This is joint work with Laurent Bindschaedler (EPFL), Jasmina Malicevic (EPFL) and Amitabha Roy (Intel Labs).
Willy Zwaenepoel (School of Computer and Communication Sciences, EPFL)
Willy Zwaenepoel received his BS/MS from the University of Gent, Belgium, and his PhD from Stanford University. He is currently a Professor of Computer Science at EPFL. Before he has held appointments as Professor of Computer Science and Electrical Engineering at Rice University, and as Dean of the School of Computer and Communication Sciences at EPFL. His interests are in operating systems and distributed systems. He is a Fellow of the ACM and the IEEE, he has received the IEEE Kanai Award and several best paper awards, and is a member of the Belgian and European Academies. He has also been involved in a number of startups, including BugBuster (acquired by AppDynamics), iMimic (acquired by Ironport/Cisco), Midokura and Nutanix.
I will present an extensive formalization of Markov chains (MCs) and Markov
decision processes (MDPs), with discrete time and (possibly
infinite) discrete state-spaces. The formalization takes a coalgebraic
view on the transition systems representing MCs and constructs their
trace spaces. On these trace spaces properties like fairness,
reachability, and stationary distributions are formalized. Similar to
MCs, MDPs are represented as transition systems with a construction for
trace spaces. These trace spaces provide maximal and minimal
expectation over all possible non-deterministic decisions. As
applications we provide a certifier for finite reachability problems
and we relate the denotational semantics and operational semantics of
the probabilistic guarded command language (pGCL).
A distinctive feature of our formalization is the order-theoretic and
coalgebraic view on our concepts: we view transition systems as
coalgebras, we view traces as coinductive streams, we provide iterative
computation rules for expectations, and we define many properties on
traces as least or greatest fixed points.
Interference is perhaps most apparent in concurrent programs that use
shared-variable variables. The search for tractable (i.e.
compositional) ways to reason about interference has an interesting
history which this talk will briefly review. Coming up to date,
current research on “Rely/Guarantee thinking” will be described and
will include some interesting connections with “Separation Logic”
which focusses on showing when interference is absent. (Although
interference is also troublesome for communication-based concurrency
and related ideas have been shown to be effective for processes, the
talk will confine itself to the shared-variable case.)
For many distributed computing problems there exists a characterizing
combinatorial structure. In other words, the existence of an algorithm
for solving the problem on any graph G in time T(G) implies the
existence of the structure in G with some parameter related to T(G),
and vice versa. Such relations go back to classic examples, such as
synchronizers and sparse spanners, and continue to emerge in recent
studies of gossip algorithms and multicast algorithms in different
communication settings. In this talk I will give an overview of both
old and recent results that illustrate these connections. I will show
how finding distributed algorithms as well as proving lower bounds can
be reduced to studying combinatorial graph structures.
No previous knowledge in distributed computing will be assumed.
In this talk we consider the problem of synthesising strategies which ensure minimal regret for a player in a Discounted-Sum Game. Such strategies minimise the difference between the payoff obtained by the player and the payoff he could have achieved if he had known the strategy of his adversary in advance. We give algorithms for the general problem of computing the minimal regret of a player as well as several variants depending on which strategies the adversary is permitted to use. We also consider the special case of synthesising regret-free strategies in the same scenarios.
Patients in intensive care often have a dozen or more medical devices
and sensors attached to them. Each is a self-contained system that
operates in ignorance of the others, and their operation as an
integrated system of systems that delivers coherent therapy is
arranged and managed by doctors and nurses. But we can easily imagine
a scenario where the devices recognize each other and self-integrate
(perhaps under the guidance of a suitable “therapy app”) into a
unified system. Similar scenarios can be (and are) envisaged for
vehicles and roads, for the devices and services in a home, and for
the general “Internet of Things.”
These self-integrating systems promise significant benefit, but also
have the potential for harm, so as they integrate they should adapt
and configure themselves appropriately and should construct an
“assurance case” for the utility and safety of the resulting system.
Thus, trustworthy self-integration requires autonomous adaptation,
synthesis, and verification at integration time, and this means that
embedded automated deduction (i.e., theorem provers) will be the
engine of integration in the Internet of Things.
Shape analysis is a static program-analysis technique that discovers and verifies properties of a program’s dynamically allocated data structures. For example, shape analysis can infer that a linked list is acyclic, and prove that a program cannot free an element more than once. More generally, shape analysis provides a method to establish properties of systems whose states can be modeled as relations that evolve over time. A shape analyzer discovers quantified invariants of the elements of such systems.
In this talk, I will describe the road from analyzing dynamically allocated data structures to analyzing network protocols and other distributed systems. The survey includes both sound techniques and complete techniques. Some of these fundamental techniques inspired tools that are deployed in industry.
Mooly Sagiv is a full professor in the School of Computer Sciences at Tel-Aviv University. He is a leading researcher in the area of large scale (inter-procedural) program analysis, and one of the key contributors to shape analysis. His fields of interests include programming languages, compilers, abstract interpretation, profiling, pointer analysis, shape analysis, inter-procedural dataflow analysis, program slicing, and language-based programming environments. Sagiv is a recipient of a 2013 senior ERC research grant for Verifying and Synthesizing Software Composition. Prof. Sagiv served as Member of the Advisory Board of Panaya Inc. He received best-paper awards at PLDI’11 and PLDI’12 for his work on composing concurrent data structures and a ACM SIGSOFT Retrospective Impact Paper Award (2011) for program slicing. He is an ACM fellow.
The transient evolution of the stochastic process induced by a biochemical system is generally analysed through solving the Chemical Master Equation (CME). However, the solution of the CME is generally infeasible, because it requires solving a number of differential equations equal to the number of reachable states, which can be huge or even infinite. An interesting alternative is to consider continuous stochastic approximations of the discrete stochastic process. In this talk, I will introduce the Linear Noise Approximation (LNA) of the CME, and show how it enables scalable model checking of formulae of Stochastic Evolution Logic (SEL), a logic for probabilistic analysis of biochemical systems. I will demonstrate that the LNA achieves good accuracy for a large class of systems.
The LNA is known to be accurate if some conditions on the propensity rates of the reactions are satisfied. These conditions are always satisfied in the limit of high population. However, in biochemical systems, it is common that these conditions are satisfied only for a subset of species and reactions (i.e. stiff systems). This leads us to the derivation of a stochastic hybrid system where some species are treated with the LNA and others with the CME. I will present formal equations for the transient evolution of the resulting stochastic hybrid process. Finally, I will show how, for a large class of systems, this approach improves the accuracy of the LNA, while still ameliorating state space explosion.
Concurrent memory reclamation, that is, the problem of reclaiming
memory in a multi-threaded environment, is one of the main problems in
concurrent data structure design. Almost all known solutions induce
high overhead, or must be customized to the specific data structure by
In this talk, I will talk about two new solutions to this problem. The
first is StackTrack, a memory reclamation algorithm which can be
applied automatically by a compiler, and still be efficient. The key
idea behind StackTrack is to eliminate most of the bookkeeping
overheads by using the hardware transactional memory (HTM) extensions
offered by new Intel processors.
The second solution, called ThreadScan, is an extension of the
framework which removes the need for hardware TM support by relaxing
progress guarantees, and exploiting the operating system signaling
mechanisms. Empirical results show that both solutions match or
improve upon the performance of previous, non-automatic solutions for
My talk will also touch upon some technical and algorithmic issues
arising when using HTM for designing fast concurrent data structures.
This talk is based on joint work with Patrick Eugster, Maurice
Herlihy, Will Leiserson, Alexander Matveev, and Nir Shavit, which
appeared in EuroSys 2014 and SPAA 2015.
Dan Alistarh is currently a researcher at Microsoft Research
Cambridge. Before MSR, Dan was a post-doc at MIT, working with Nir
Shavit and Nancy Lynch. He obtained his PhD from EPFL, under the
guidance of Rachid Guerraoui. Dan’s research focuses on building
scalable algorithms, and combines algorithmic ideas with practical