January 13, 2023
Journal Article

Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numerical Behaviors

Abstract

General Matrix Multiplication (GEMM) is a fundamental computation pattern in various applications such as Deep Neural Networks and Computer Graphics. GPGPUs play an important role in accelerating GEMM. Nvidia first introduced the Tensor Cores in its Volta Architecture, which has evolved to Turing (second generation) and Ampere (third generation). Tensor Cores can provide significant speed-up over traditional Processing Units (i.e., CUDA cores). Although Tensor Cores share a similar memory hierarchy with CUDA cores, the programmability and performance between CUDA cores and Tensor Cores are significantly different. However, many important details of the Tensor Cores are not released by the GPU vendor. In this paper, we introduce a set of microbenchmarks to expose the unknown characteristics of the Tensor Cores. Specifically, we investigate the three main sets of PTX instructions for Tensor Cores – ldmatrix, mma, and mma.sp, and highlight their differences with the CUDA Tensor Cores WMMA APIs, which serve as higher CUDA APIs but can only exploit limited features and power of Tensor Cores. We also study the numeric behaviors of low-precision floating-point data types (TF32, BF16, and FP16) of Tensor Cores computation. Our microbenchmark results offer a better understanding of Nvidia Tensor Cores which can facilitate the optimization of Tensor Cores software and accurate Tensor Cores modeling.

Published: January 13, 2023

Citation

Sun W., A. Li, T. Geng, S. Stuijk, and H. Corporaal. 2023. Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numerical Behaviors. IEEE Transactions on Parallel and Distributed Systems 34, no. 1:246 - 261. PNNL-SA-173565. doi:10.1109/TPDS.2022.3217824