Parallel Programming Laboratory

Optimizing Non-Commutative Allreduce over Virtualized, Migratable MPI Ranks

| Sam White | Laxmikant Kale

Workshop on Advances in Parallel and Distributed Computing Models at IPDPS (APDCM) 2022

Publication Type: Paper

Repository URL:

Download:

Abstract

Dynamic load balancing can be difficult for MPI-based applications. Application logic and algorithms are often rewritten to enable dynamic repartitioning of the domain. An alternative approach is to virtualize the MPI ranks as threads--instead of operating system processes-- and to migrate threads around the system to balance the computational load. Adaptive MPI is one such implementation. It supports virtualization of MPI ranks as migratable user-level threads. However, this migratability itself can introduce new performance overheads to applications. In this paper, we identify non-commutative reduction operations as problematic for any runtime supporting either user-defined initial mapping of ranks or dynamic migration of ranks among the cores or nodes of a machine. We investigate the challenges associated with supporting efficient non-commutative reduction operations, and implement algorithmic alternatives in Adaptive MPI's implementation such as recursive doubling with adaptive message combining. We explore tradeoffs in the different algorithms for different message sizes and mappings of ranks to cores, demonstrating our performance improvements using microbenchmarks.

People

Research Areas

Live Webcast 15th Annual Charm++ Workshop