Power, Reliability, Performance: One System to Rule Them All
IEEE Computer Journal Special Issue on Energy-Efficient Computing (Computer) 2016
Publication Type: Paper
Traditionally, the emphasis of High Performance Computing (HPC) data centers and applications has been on performance. However, it is anticipated that future generation supercomputing systems will face major challenges in reliability, power management, and thermal variations. Disruptive solutions are required to optimize performance in the presence of these challenges. We believe that a smart parallel runtime system that is part of each job, and that interacts with an adaptive resource manager for the whole machine, is key to overcome the challenges of next generation supercomputing data centers.