Randomly Pivoted Cholesky: Near-Optimal Positive Semidefinite Low-Rank Approximation from a Small Number of Entry Evaluations
Ethan N. Epperly, Yifan Chen, Joel A. Tropp, Robert J. Webber
Abstract
This talk describes randomly pivoted Cholesky (RPCholesky), a randomized algorithm for computing a low-rank approximation to a Hermitian postive semidefinite (psd) matrix. To compute a rank-\(k\) approximation to an \(N\times N\) matrix, RPCholesky performs a \(k\)-step partial Cholesky decomposition with a pivot entry randomly chosen with probabilities proportional to diagonal entries of the current residual matrix (i.e., Schur complement). The algorithm requires \(\order (k^2N)\) operations and reads only \((k+1)N\) entries of the input matrix.
The RPCholesky method has an interesting history. The existence of the method was briefly noted in a 2017 paper of Musco and Woodruff [9], and it is algebraically related to an earlier “randomly pivoted QR” algorithm of Desphande, Rademacher, Vempala, and Wang (2006, [3]). Our paper [2], originally released in 2022, reintroduced the algorithm, described its connection to Cholesky decomposition, evaluated the method numerically, and provided new theoretical results.
Surprisingly, this simple algorithm is guaranteed to produce a near-optimal low-rank approximation. The output of RPCholesky, and any other partial Cholesky decomposition, is low-rank approximation of the form
\(\seteqnumber{0}{}{0}\)\begin{equation*} \mat {\hat {A}} = \mat {A}(:,\set {S}) \mat {A}(\set {S},\set {S})^\dagger \mat {A}(\set {S},:), \end{equation*}
where \(\set {S}\) denotes the set of pivots selected by the algorithm and \({}^\dagger \) denotes the Moore–Penrose pseudoinverse. This type of low-rank approximation is known as a (column) Nyström approximation and is used widely to accelerate kernel machine learning methods. It is known [7] that \(k \ge r/\varepsilon \) columns \(\set {S}\) are needed to produce a Nyström approximation \(\mat {\hat {A}}\) within a \(1+\varepsilon \) multiplicative factor of the best rank-\(r\) approximation \(\lowrank {\mat {A}}_r\), i.e.,
\(\seteqnumber{0}{}{0}\)\begin{equation*} \norm {\smash {\mat {A} - \mat {\hat {A}}}}_* \le (1+\varepsilon ) \norm {\mat {A} - \lowrank {\mat {A}}_r}_*. \end{equation*}
Here, \(\norm {\cdot }_*\) denotes the trace norm. In [2], we showed that RPCholesky achieves the guarantee:
\(\seteqnumber{0}{}{0}\)\begin{equation*} \mathbb {E} \left [\norm {\smash {\mat {A} - \mat {\hat {A}}}}_*\right ] \le (1+\varepsilon ) \norm {\mat {A} - \lowrank {\mat {A}}_r}_* \quad \text {when } k \ge \frac {r}{\varepsilon } + r \log \left ( \frac {1}{\varepsilon \eta } \right ). \end{equation*}
Here, \(\mat {\hat {A}}\) is the approximation produced by \(k\) steps of RPCholesky and \(\eta = \norm {\mat {A} - \lowrank {\mat {A}}_r}_*/\norm {\mat {A}}_*\) denotes the relative error of the best rank-\(r\) approximation. In expectation, RPCholesky achieves the optimal scaling \(k = r/\varepsilon \) up to an additive term that is logarithmic in the relative error \(\eta \).
RPCholesky has proven effective at accelerating kernel machine learning methods. Given a data set \(\vec {x}_1,\ldots ,\vec {x}_N\), kernel methods perform machine learning tasks such as regression and clustering by manipulating a psd kernel matrix \(\mat {A} = (\kappa (\vec {x}_i,\vec {x}_j))_{1\le i,j\le N}\), where \(\kappa \) is a given positive definite kernel function. When implemented directly, kernel methods require \(\order (N^3)\) operations and \(\order (N^2)\) storage. By replacing \(\mat {A}\) with a low-rank approximation \(\mat {\hat {A}}\) (say, of rank \(k = \order (1)\)), the storage and runtime costs of these methods are reduced to \(\order (N)\). This talk will present numerical experiments from [2], which show that an RPCholesky-accelerated clustering method can be \(9\times \) to \(14\times \) more accurate than accelerated clustering methods using other low-rank approximation techniques. Subsequent papers have applied RPCholesky to accelerate learning of committer functions in biochemistry [1], as a preconditioner for conjugate gradient [4], for quadrature in reproducing kernel Hilbert spaces [5], and compression of data sets [8].
While the standard version of RPCholesky is already fast, it is slower than it could be because it processes the columns of the input matrix one-by-one. A blocked version of the method is faster, but can produce approximations of lower accuracy. This talk will conclude by discussing the recently introduced accelerated RPCholesky method [6], which simulates the performance of original RPCholesky using a combination of rejection sampling and block-wise computations. The accelerated RPCholesky method can be up to \(40\times \) faster than the original method while producing the same random output (in exact arithmetic).
References
-
[1] David Aristoff, Mats Johnson, Gideon Simpson, and Robert J. Webber. The fast committor machine: Interpretable prediction with kernels. The Journal of Chemical Physics, 161(8):084113, 2024.
-
[2] Yifan Chen, Ethan N. Epperly, Joel A. Tropp, and Robert J. Webber. Randomly pivoted Cholesky: Practical approximation of a kernel matrix with few entry evaluations. Communications on Pure and Applied Mathematics, accepted, 2024.
-
[3] Amit Deshpande, Luis Rademacher, Santosh Vempala, and Grant Wang. Matrix approximation and projective clustering via volume sampling. In Proceedings of the 2006 Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1117–1126, 2006.
-
[4] Mateo Díaz, Ethan N. Epperly, Zachary Frangella, Joel A. Tropp, and Robert J. Webber. Robust, randomized preconditioning for kernel ridge regression. arXiv preprint arXiv:2304.12465, 2024.
-
[5] Ethan N. Epperly and Elvira Moreno. Kernel quadrature with randomly pivoted Cholesky. In Advances in Neural Information Processing Systems 36, 2023.
-
[6] Ethan N. Epperly, Joel A. Tropp, and Robert J. Webber. Embrace rejection: Kernel matrix approximation by accelerated randomly pivoted Cholesky. arXiv preprint arXiv:2410.03969, 2024.
-
[7] Venkatesan Guruswami and Ali Kemal Sinop. Optimal column-based low-rank matrix reconstruction. In Proceedings of the 2012 Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1207–1214. 2012.
-
[8] Lingxiao Li, Raaz Dwivedi, and Lester Mackey. Debiased distribution compression. arXiv preprint arXiv:2404.12290, 2024.
-
[9] C. Musco and D. P. Woodruff. Sublinear Time Low-Rank Approximation of Positive Semidefinite Matrices. In 2017 IEEE Annual Symposium on Foundations of Computer Science, pages 672–683, 2017.