A Distributed Dynamic Load Balancer for Iterative Applications
International Conference for High Performance Computing, Networking, Storage and Analysis (SC) 2013
Publication Type: Paper
For many applications, computation load varies over time. Such applications require dynamic load balancing to im- prove the performance. Centralized load balancing schemes, which collect global information and compute the decisions at a central location, are not scalable. In contrast, fully dis- tributed strategies are scalable but typically do not produce a balanced work distribution as they tend to consider only local information. This paper describes a fully distributed algorithm for load balancing that uses partial information about the global state of the system to make good load balancing decisions. This algorithm, referred to as Grapevine, consists of two stages: global information propagation using a lightweight algorithm inspired by epidemic  algorithms, and work unit transfer using a randomized algorithm. The proposed algorithm is scalable and can be tuned to optimize for ei- ther cost or performance. We substantiate our claims by providing analysis of the algorithm, detailed simulation and performance comparison on adaptive mesh refinement and molecular dynamics applications. We demonstrate the ef- fectiveness of Grapevine for adaptive mesh refinement and molecular dynamics on up to 131,072 cores of BlueGene/Q.