Mixed-precision hardware has recently become commercially available, and more than 25% of the supercomputers in the TOP500 list now have mixed-precision capabilities. Using lower precision in algorithms can be beneficial in terms of reducing both computation and communication costs.
There are many current efforts towards developing mixed-precision numerical linear algebra algorithms, which can lead to speedups in real applications. Motivated by this, we aim to further the state of the art in developing and analyzing mixed-precision variants of iterative methods.
In iterative methods based on Krylov subspaces, the orthogonal basis is generated by Arnoldi or Lanczos methods or their variants. In long recurrence methods such as GMRES, one needs to use an explicit orthogonalization scheme such as Gram-Schmidt to orthonormalize the vectors generated.
One of the variants of the Gram-Schmidt method is Classical Gram-Schmidt (CGS). From a performance perspective, CGS is preferable as it only involves one synchronization per vector.
The computation of Krylov subspace bases and orthogonalization procedures require sparse matrix-vector products and dot products in each iteration. These computations cause Krylov subspace methods to have a communication bottleneck in practical settings.
Communication-avoiding Krylov subspace methods (also called s-step methods) can asymptotically reduce the communication cost per iteration versus standard approaches. To avoid communication, iterations are computed in blocks of s, which involves generating a basis for a Krylov subspace followed by block orthogonalization.
In this way, data dependencies between sparse matrix-vector multiplication and dot products in the method can be broken. To perform block orthogonalization, BGS algorithms are necessary for block and s-step Krylov subspace methods.
As standalone orthogonalization schemes, block approaches also have performance benefits as they make use of BLAS3 operations. BCGS-P is a block variant of classical Gram-Schmidt based on the Pythagorean theorem which guarantees $O(u)\kappa^2(A)$ loss of orthogonality as long as $O(u)\kappa^2(A)<1.
This constraint on the condition number of the input matrix tightly limits the problems to which this algorithm applies. One can potentially loosen the constraint in two ways: (1) using a mixed-precision CholQR routine and (2) incorporating a shift into CholQR.
In this talk, both potential strategies will be explored.