Preconditioning without a preconditioner: faster ridge-regression and Gaussian sampling with randomized block Krylov methods
Tyler Chen, Caroline Huber, Ethan Lin, Hajar Zaid
Abstract
One of the most important tasks in numerical linear algebra is solving the linear system
\(\seteqnumber{0}{}{0}\)\begin{equation} \label {eqn:regularized_ls} \vec {A} \vec {x} = \vec {b}, \end{equation}
where \(\vec {A}\in \mathbb {R}^{d\times d}\) is symmetric positive definite with eigenvalues \(\lambda _1\geq \lambda _2 \geq \cdots \geq \lambda _d > 0\). Krylov subspace methods (KSMs) such as the conjugate gradient method are among the most powerful methods for this problem and are guaranteed to converge extremely rapidly if the system is well-conditioned; i.e. if \(\lambda _1\approx \lambda _d\). For ill-conditioned systems, preconditioning can greatly accelerate the convergence of KSMs. When \(\vec {A}\) has a rapidly decaying spectrum, a technique called Nyström preconditioning has proven effective [1].
Consider the Nyström approximation
\(\seteqnumber{0}{}{1}\)\begin{equation} \label {eqn:nystromBKI_def} \vec {A}\langle \vec {K}_s \rangle := (\vec {A}\vec {K}_s)(\vec {K}_s^\T \vec {A}\vec {K}_s)^\dagger (\vec {K}_s^\T \vec {A}), \end{equation}
where \(\vec {\Omega }\in \mathbb {R}^{d\times (r+2)}\) is a matrix of independent standard normal random variables and \(\vec {K}_s := [\vec {\Omega }~\vec {A}\vec {\Omega }~\cdots ~\vec {A}^{s-1}\vec {\Omega }] \in \mathbb {R}^{d\times s(r+2)}\). It can be guaranteed that if \(s=O(\log (d))\), then with high probability, \(\vec {A}\langle \vec {K}_s\rangle \) approximates \(\vec {A}\) with spectral-norm error comparable to the best-possible rank-\(r\) approximation to \(\vec {A}\); i.e. \(\|\vec {A} - \vec {A}\langle \vec {K}_s\rangle \| = O(\lambda _{r+1})\) [3]. Define a preconditioner
\(\seteqnumber{0}{}{2}\)\begin{equation} \label {eqn:precond_def} \vec {P} := \frac {1}{\lambda _{r+1}} \vec {U}\vec {D}\vec {U}^\T + (\vec {I} - \vec {U}\vec {U}^\T ), \end{equation}
where \(\vec {U}\vec {D}\vec {U}^\T \) is the eigendecomposition of \(\vec {A}\langle \vec {K}_s\rangle \). Following the approach of [1], we show that if \(\theta \in [\lambda _d,\lambda _{r+1}]\) and \(s = O(\log (d))\), then with high probability, then
\(\seteqnumber{0}{}{3}\)\begin{equation} \kappa (\vec {P}^{-1/2}\vec {A}\vec {P}^{-1/2}) = O(\lambda _{r+1} / \lambda _d). \end{equation}
As a result, preconditioned-CG with the preconditioner (3) converges at a rate depending on \(\sqrt {\lambda _{r+1}/\lambda _d}\) [2]. If \(\vec {A}\) has just \(r\) large eigenvalues, the convergence of preconditioned-CG will be extremely rapid.
One downside to Nyström preconditioning is the need to choose hyperparameters such as \(\theta \) and \(s\). Our observation is that, after \(t\) iterations, block-CG with a starting block \([\vec {b}~\vec {\Omega }]\) has error at most that of Nyström preconditioned CG after \(t-s-1\) iterations. Thus, block-CG enjoys the effects of (Nyström) preconditioning, without the need for constructing a preconditioner or choose parameters.1 This allows us to prove the following convergence guarantee.2
1 We are assuming iterations, not matrix-vector products, are the dominant cost.
2 This bound is reminiscent of the “killing off the top eigenvalues” bounds for CG. However, instead of a burn-in period of \(r\) iterations, we require a burn-in period of \(O(\log (d))\) iterations (independent of \(r\)).
-
Theorem 1. Fix a value \(r\geq 0\) and let \(\vec {b}_2, \ldots , \vec {b}_{r+2}\) be independent standard Gaussian vectors. Then after \(t\) iterations the block-CG iterate \(\bcg _t\) corresponding to a starting block \([\vec {b}~\vec {b}_2~\cdots ~\vec {b}_{r+2}]\) satisfies, with probability at least \(99/100\),
\[ \frac {\| \vec {A}^{-1}\vec {b} - \bcg _t \|_{\vec {A}}}{\| \vec {A}^{-1}\vec {b} \|_{\vec {A}}} \leq 2 \exp \left ( - \frac {t-(3+\log (d)/2)}{3\sqrt {\lambda _{r+1}/\lambda _d}}\right ). \]
More generally, for any \(\mu \geq 0\), block-CG (and Nyström preconditioned CG) can be used to solve the regularized linear system
\(\seteqnumber{0}{}{4}\)\begin{equation} \label {eqn:regularized_ls2} (\vec {A}+\mu \vec {I}) \vec {x} = \vec {b}. \end{equation}
Systems of the form (1) arise in a variety of settings, but we are particularly motivated by two critical tasks in machine learning and data science: solving ridge-regression problems and sampling Gaussian vectors. By adapting our bound Theorem 1 for block-CG, we obtain state-of-the-art convergence guarantees for existing Lanczos-based methods used to solve these tasks.
References
-
[1] Z. Frangella, J. Tropp, and M. Udell (2023). Randomized Nyström Preconditioning. SIAM Journal on Matrix Analysis and Applications, 44(2), 718–752.
-
[2] A. Greenbaum (1997). Iterative Methods for Solving Linear Systems. Society for Industrial and Applied Mathematics.
-
[3] J. Tropp and R. Webber (2023). Randomized algorithms for low-rank matrix approximation: Design, analysis, and applications.