Skip to main content

SC19 Schedule

The Case of Performance Variability on Dragonfly-based Systems

10:10 Wednesday, November 20

Performance of a parallel code running on a large supercomputer can vary significantly from one run to another even when the executable and input parameters are left unchanged. Such variability can occur due to perturbation of the computation and/or communication in the code. In this paper, we investigate the case of performance variability arising due to network effects on supercomputers that use a dragonfly topology -- specifically, Cray XC systems equipped with the Aries interconnect. We perform post-mortem analysis of network hardware counters, profiling output, job queue logs, and placement information, all gathered from periodic representative application runs. We investigate the causes of performance variability using deviation prediction and recursive feature elimination. Additionally, using machine learning, we create models based on time-stepped performance data of individual applications that can forecast the execution time of future time steps.

Slides will be available for download here after the presentation.

Speaker Bio - Abhinav Bhatele

Picture of Abhinav Bhatele

Abhinav Bhatele is an assistant professor in the department of computer science at the University of Maryland, College Park. Previously, he was a senior computer scientist in the Center for Applied Scientific Computing at Lawrence Livermore National Laboratory. His research interests are broadly in systems and networks, with a focus on parallel computing and big data analytics. He has published research in programming models and runtimes, network design and simulation, applications of machine learning to parallel systems, and on analyzing, modeling and optimizing the performance of parallel software and systems.

Abhinav received a B.Tech. degree in Computer Science and Engineering from I.I.T. Kanpur, India in May 2005, and M.S. and Ph.D. degrees in Computer Science from the University of Illinois at Urbana-Champaign in 2007 and 2010 respectively. Abhinav was an ACM-IEEE CS George Michael Memorial HPC Fellow in 2009. He has received best paper awards at Euro-Par 2009, IPDPS 2013 and IPDPS 2016. Abhinav was selected as a recipient of the IEEE TCSC Young Achievers in Scalable Computing award in 2014, and the LLNL Early and Mid-Career Recognition award in 2018.






Back to Top