Tensors, which are matrices extended to three or more dimensions, can naturally model large and sparse multi-way data. Performing factorization on tensors using methods such as Canonical Decomposition/Parallel Factorization with Alternating Least Squares (CP-ALS) is becoming an increasingly popular tool in large-scale data analytics. Prior work in this area has focused on decreasing the run time of the critical routines in CP-ALS. However, as the computation time for these routines decreases, communication becomes a significant performance bottleneck. In this work, we enhance ReFacTo, a distributed heterogeneous CP-ALS implementation, to reduce its communication costs. We evaluate the performance of ReFacTo on a standard cluster with a single GPU per node and on the NVIDIA DGX-1, a single-node system with 8 GPUs connected by the NVLink interconnect. On each system, we compare the communication performance of MPI, CUDA-aware MPI, and NVIDIA's Collective Communications Library (NCCL). Our results show that for large tensors, NCCL performs better than MPI and CUDA-aware MPI on both systems. We find that the communication time on the DGX-1 when using NCCL is up to 10.9x faster than using MPI and 2.85x faster than CUDA-aware MPI, leading to as much as a 64% decrease in overall CP-ALS application runtime. On the cluster, we find that NCCL is up to 2x faster than MPI and 1.5x faster than CUDA-aware MPI.
Slides for download: powerpoint format
Thomas B. Rolinger is a researcher at the Laboratory for Physical Sciences at the University of Maryland. He received a B.S. in Computer Science at the University of West Florida and a M.S. in Computer Science from Florida State University. He is now a Ph.D.
student at the University of Maryland, College Park. His research interests include evaluating the performance of irregular algorithms and distributed multi-GPU applications.
Back to Top