ceClub: Predicting Execution Bottlenecks in Map-Reduce Clusters

Edward Bortnikov (Yahoo! Labs Israel)
Wednesday, 19.12.2012, 11:30
EE Meyer Building 861

Extremely slow, or straggler, tasks are a major performance bottleneck in map-reduce systems. Hadoop infrastructure makes an effort to both avoid them (through minimizing remote data accesses) and handle them in the runtime (through speculative execution). However, the mechanisms in place neither guarantee the avoidance of performance hotspots in task scheduling, nor provide any easy way to tune the timely detection of stragglers.

We suggest a machine-learning approach to address these problems, and introduce a slowdown predictor - an oracle to forecast how much slower a task will run on a given node, compared to similar tasks. Slowdown predictors can be embedded in the map-reduce infrastructure to improve the agility and timeliness of scheduling decisions. We provide initial evaluation to demonstrate the viability of our approach, and discuss the use cases for the new paradigm.

Bio: Edward Bortnikov is a Principal Research Engineer at Yahoo! Labs. His interests broadly span large scale distributed systems, search, big data analytics, and networking technologies. He published 20+ scientific papers in these areas. He received his PhD in Electrical Engineering from the Technion - Israel Institute of Technology in 2008. Prior to that, he worked in technical leadership positions at IBM Research, Mellanox Technologies, SANgate Systems, and HP Labs.

Back to the index of events