Master's thesis presentation. Yakup is advised by Ravil Dorozhinskii, David Schneller, and Prof. Dr. Michael Bader.
Previous talks at the SCCS Colloquium
Yakup Koray Budanaz: Improved GPU Kernel Generation for SeisSol using Loop-over-GEMM and Sparse-Matrix Operations
SCCS Colloquium |
Matrix multiplications and tensor contractions play a pivotal role in achieving high performance in Arbitrary High-Order Derivative Discontinuous Galerkin (ADER-DG) schemes, where the computational domain is divided into cells, with each cell described by a matrix or a tensor and the time integration requires batched matrix multiplications or tensor contractions where the involved matrices or tensors per cell have tiny dimensions, such that they fit into the caches, but the number of cells, and therefore the number of matrices and tensors in a batch can reach up to millions, and existing linear algebra libraries and tensor contraction implementations fall short in obtaining high performance for the involved batched tensor contractions and matrix multiplications. For the work, Gemmforge, a code generator for batched tiny general matrix multiplications (GEMM), that generates highly-optimized batched GPU GEMM kernels for SeisSol targeting multiple backends, is extended to support sparse matrix operations through sparse-by-dense and dense-by-sparse matrix multiplication, targeting matrices with static sparsity patterns with or without values known at compile time and to support tensor contractions through loop-over-GEMM and component-wise product approaches, targeting non-transposed and dense tensors. The implemented dense-by-sparse kernels obtain consistently around 90\% of the peak floating point performance with respect to operational intensity, and the sparse-by-dense kernels between 80\% and 90\%. The component-wise-product and loop-over-GEMM kernels commonly obtain between 80\% and 90\% of the peak performance wrt. of operational intensity, but their performance may occasionally drop due to the dimensions of the involved operands and the involved sub-tensor slices, although the performance can be restored back with the correct optimizations, and a draft for the generalization of the optimization ideas are discussed for future work.