GPUs are one of the crucial components in many leading High-Performance Computing systems. These systems utilize GPUs for tasks such as machine learning, deep learning, and large scientific simulations. In order to achieve effective use of GPUs, we need to understand how well the current architectures utilize them. However, despite the critical role of GPUs, a significant gap remains in understanding their utilization. Previous approaches have focused on analyzing system-wide monitoring data to understand the utilization of various resources such as CPU, GPU, IO, memory, and power but have not comprehensively explored a variety of GPU-specific hardware counters. This study fills this gap by examining previously unavailable GPU hardware counters collected via Lightweight Distributed Metric Service on Perlmutter, including memory copy operations, precision types, streaming multiprocessors, framebuffers, and NVLink interconnects. To the best of our knowledge, this is the first study that has conducted an extensive analysis of how workloads utilize the GPUs in Perlmutter. By analyzing these hardware counters, we provide insights into current GPU utilization patterns and opportunities for architectural improvements.
Slides will be available for download here after the presentation.
Onur Cankur is a PhD student at the University of Maryland. He is working on high-performance computing (HPC) at Parallel Software and Systems Group with Abhinav Bhatele. His main interest is in performance analysis and modeling of large-scale parallel applications. He is also interested in resource usage analysis and modeling of HPC systems.