Isolating GPU Architectural Features Using Parallelism-Aware Microbenchmarks

Abstract

GPUs develop at a rapid pace, with new architectures emerging every 12 to 18 months. Every new GPU architecture introduces new features, expecting to improve on previous generations. However, the impact of these changes on the performance of GPGPU applications may not be directly apparent; it is often unclear to developers how exactly these features will affect the performance of their code. In this paper we propose a suite of microbenchmarks to uncover the performance of novel GPU hardware features in isolation. We target features in both the memory system and the arithmetic cores. We further ensure, by design, that our microbenchmarks capture the massively parallel nature of the GPUs, while providing fine-grained timing information at the level of individual compute units. Using this benchmarking suite, we study the differences between three of the most recent NVIDIA architectures: Pascal, Turing, and Ampere. We find that the architecture differences can have a meaningful impact on both synthetic and more realistic applications. This impact is visible both in terms of outright performance, but also affects the choice of execution parameters for realistic applications. We conclude that microbenchmarking, adapted to massive GPU parallelism, can expose differences between GPU generations, and discuss how it can be adapted for future architectures.

Publication
ICPE ‘22: Proceedings of the 2022 ACM/SPEC on International Conference on Performance Engineering