Over the last year, there has been significant interest in solving many small linear algebra problems simultaneously. Library vendors such as MKL and NVIDIA, along with researchers at instutions including Manchester, Tennessee, and Sandia National Labs have all been attempting to perform these calculations as efficiently as possible.
Over the weekend prior to the SIAM CSE17 meeting, many of those researchers (including myself) held a workshop to discuss strategies for batched BLAS (basic linear algebra subprogram) computations. Furthermore, lots of discussion was aimed at standardising the function APIs and the memory layout that users will interact with. The slides, and a number of research papers on the topic, are available at this page.
At the SIAM CSE17 meeting, our team at Manchester organised a minisymposium to discuss the highlights of our weekend with a wider audience. A brief summary of the four talks, along with a copy of their slides, is given below.
Talk 1: Mawussi Zounon (Univ. of Manchester)
Mawussi Zounon, University of Manchester.
The first talk was given by my colleague Mawussi Zounon. After introducing the basic concept behind batched BLAS, he discussed a strategy called “interleaving” that we have been working on to increase the performance of batched BLAS routines. Essentially, by changing the order of the matrix elements in memory, we can optimise the use of the vector units and cache memory in modern multi-core architectures such as the 68 core Intel Knights Landing (KNL). This can lead to extremely large speedups (> 40x) for batch Cholesky solves. The slides can be found here.
Talk 2: Sarah Knepper (Intel)
Sarah Knepper, Intel.
In the second talk, Sarah Knepper outlined the new Intel KNL architecture, discussed MKLs proposed API for batched BLAS operations, and gave some excellent performance results for batched DGEMM and DTRSM routines that will appear in MKL 18 beta later this year. Whilst MKL 18 beta will not include the interleaved memory format discussed by Mawussi, there are plans for it to be incorporated in future releases. Sarah’s presentation can be found here.
Talk 3: Chris Cecka (NVIDIA)
Chris Cecka, NVIDIA.
Next up, Chris Cecka spoke about his recent work on accelerating tensor contractions on GPUs. This is a vitally important task to accelerate models in deep learning, but also appears in a number of high-dimensional PDEs etc. By using batched (and interleaved) matrix multiplications, Chris explained how all possible permutations of the tensor indices can be computed. He also showed the excellent performance improvements to the strided batched GEMM in the latest CUDA toolkit release: the performance of batched GEMM is the essentially the same as regular GEMM. More information on strided batched GEMM can be found here whilst the slides for this talk are here.
Talk 4: Ahmad Abdelfattah (Univ. of Tennessee)
Slides from Ahmad Abdelfattah, University of Tennessee.
The final talk of the session was given by Ahmad Abdelfattah from the Innovative Computing Laboratory, Univ. of Tennessee. Ahmad spoke about the design of batched BLAS operations in MAGMA (a popular linear algebra library for GPUs). In particular, he gave details about how the memory layout must be adapted for GPU architectures to deal with the different memory heirarchy. Ahmad’s slides are available here.
Overall the minisymposium was a massive success. There were around 60 people in the audience and the talks sparked some healthy debate about our approaches, and the performance that can be obtained. We would welcome any feedback on batched linear algebra, or on how it can be exploited in applications.
2 thoughts on “Batched BLAS Operations at SIAM CSE17”