Live Webcast 15th Annual Charm++ Workshop

Extending Paratreet, A Framework For Spatial Tree Base Algorithms, With GPU Kernels and Load Balancers
Thesis 2022
Publication Type: MS Thesis
Repository URL:
Improving the performance of iterative, computationally heavy applications with frequent memory access is challenging and exciting. This thesis shows the performance improvement efforts of the paraTreeT library. ParaTreeT is a parallel tree toolkit inspired by the N-body simulation problem to model and investigate the dynamic motion of astronomical bodies given a set of initial conditions. ParaTreeT provides a generic parallel tree traversal framework targeting high scalability and programmability. The inputs from the user are partitioned and decomposed into leaf nodes of a chosen tree structure. The interactions among particles are done through traversals of a global tree. Users apply their custom structs of user data and define the tree type into which the data is partitioned, as well as the partition algorithm used. In addition, the library is extendable with custom traversal algorithms. ParaTreeT has achieved better central processing unit (CPU) performance compared to its predecessor ChaNGa by providing tree data using a shared-memory cache model, as well as separating the data computation functionality from its spatial tree representation. ParaTreeT demonstrated a 2-3x speedup over ChaNGa using up to 256 nodes (21504 threads) on the Summit’s POWER9 machine. This thesis focuses on improving the performance of the ParaTreeT framework through two approaches. The first approach is to implement load balancing strategies to speed up the iterative code with growing load imbalance. This thesis presents different load balancing algorithms and efforts to make them scalable. With theoretical runtime analysis of each algorithm, together with scaling experiments, the thesis identifies the scaling bottleneck and scalability of each algorithm. The second approach is to add general-purpose computing on graphics processing units (GPGPU) kernels to offload highly parallel computationally intensive work from CPU hosts to graphics processing units (GPUs). This thesis identifies the computationally heavy blocks of the CPU implementation and proposes multiple GPU kernels to speed up the computation. The thesis also analyzes the overhead of the GPU implementation, with discussion of plans for future improvements.
Research Areas