Parallel Programming Laboratory

Achieving Computation-Communication Overlap with Overdecomposition on GPU Systems

| Jaemin Choi | David Richards | Laxmikant Kale

Workshop on Extreme Scale Programming Models and Middleware (ESPM2) 2020

Publication Type: Paper

Repository URL:

Download: [BIB] [PDF]

Abstract

The landscape of high performance computing is shifting towards a collection of multi-GPU nodes, widening the gap between on-node compute and off-node communication capabilities. Consequently, the ability to tolerate communication latencies and maximize utilization of the compute hardware are becoming increasingly important in achieving high performance. Overdecomposition, which enables a logical decomposition of the problem domain without being constrained by the number of processors, has been successfully adopted on traditional CPU-based systems to achieve computation-communication overlap, significantly reducing the impact of communication on performance. However, it has been unclear whether overdecomposition can provide the same benefits on modern GPU systems, especially given the perceived overheads associated with smaller kernels that overdecomposition entails. In this work, we address the challenges in applying overdecomposition to GPU-accelerated applications and ensuring asynchronous progress of GPU operations using the Charm++ parallel programming system. Combining prioritization of communication in the application and support for asynchronous progress in the runtime system, we obtain improvements in overall performance of up to 50% and 47% with proxy applications Jacobi3D and MiniMD, respectively.

People

Jaemin Choi
David Richards
Laxmikant Kale

Research Areas

Live Webcast 15th Annual Charm++ Workshop