Mixed precision hardware has recently become commercially available, and more than 25% of the supercomputers in the TOP500 list now have mixed precision capabilities. Using lower precision in algorithms can be beneficial in terms of reducing both computation and communication costs.
There are many current efforts towards developing mixed precision numerical linear algebra algorithms, which can lead to speedups in real applications. Motivated by this, we aim to further the state-of-the-art in developing and analyzing mixed precision variants of iterative methods.
In iterative methods based on Krylov subspaces, the orthogonal basis is generated by Arnoldi or Lanczos methods or their variants. In long recurrence methods such as GMRES, one needs to use an explicit orthogonalization scheme such as Gram-Schmidt to orthonormalize the vectors generated.Krylov subspace methods, such as GMRES, are typically communication-bound on modern machines; the runtime is usually dominated by the cost of global synchronizations which are required for the necessary orthogonalization.
This has motivated the development of various algorithmic variants which attempt to reduce the communication cost while maintaining a stable algorithm. Recent work has focused on the development of low-synchronization block variants of Gram-Schmidt orthogonalization, which, when used within communication-avoiding (s-step) or block variants of GMRES, can reduce the number of global synchronizations to one per block.In this work, we focus on the block variant of low-synchronization classical Gram-Schmidt with reorthogonalization, which we call BCGSI+LS.
We demonstrate that the loss of orthogonality produced by this orthogonalization scheme can exceed O(u)κ(𝒳), where u is the unit roundoff and κ(𝒳) is the condition number of the matrix to be orthogonalized, and thus we can not in general expect this to result in a backward stable block GMRES implementation. We then develop a mixed precision variant of this algorithm, called BCGSI+LS-MP, which uses higher precision in certain parts of the computation.
We demonstrate experimentally that for a number of challenging test problems, our mixed precision variant successfully maintains a loss of orthogonality below O(u)κ(��). This indicates that we can achieve a backward stable block GMRES algorithm that requires only one synchronization per iteration.