PDF
\(\newcommand{\footnotename}{footnote}\) \(\def \LWRfootnote {1}\) \(\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\let \LWRorighspace \hspace \) \(\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }\) \(\newcommand {\mathnormal }[1]{{#1}}\) \(\newcommand \ensuremath [1]{#1}\) \(\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } \) \(\newcommand {\setlength }[2]{}\) \(\newcommand {\addtolength }[2]{}\) \(\newcommand {\setcounter }[2]{}\) \(\newcommand {\addtocounter }[2]{}\) \(\newcommand {\arabic }[1]{}\) \(\newcommand {\number }[1]{}\) \(\newcommand {\noalign }[1]{\text {#1}\notag \\}\) \(\newcommand {\cline }[1]{}\) \(\newcommand {\directlua }[1]{\text {(directlua)}}\) \(\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}\) \(\newcommand {\protect }{}\) \(\def \LWRabsorbnumber #1 {}\) \(\def \LWRabsorbquotenumber "#1 {}\) \(\newcommand {\LWRabsorboption }[1][]{}\) \(\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }\) \(\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }\) \(\def \mathcode #1={\mathchar }\) \(\let \delcode \mathcode \) \(\let \delimiter \mathchar \) \(\def \oe {\unicode {x0153}}\) \(\def \OE {\unicode {x0152}}\) \(\def \ae {\unicode {x00E6}}\) \(\def \AE {\unicode {x00C6}}\) \(\def \aa {\unicode {x00E5}}\) \(\def \AA {\unicode {x00C5}}\) \(\def \o {\unicode {x00F8}}\) \(\def \O {\unicode {x00D8}}\) \(\def \l {\unicode {x0142}}\) \(\def \L {\unicode {x0141}}\) \(\def \ss {\unicode {x00DF}}\) \(\def \SS {\unicode {x1E9E}}\) \(\def \dag {\unicode {x2020}}\) \(\def \ddag {\unicode {x2021}}\) \(\def \P {\unicode {x00B6}}\) \(\def \copyright {\unicode {x00A9}}\) \(\def \pounds {\unicode {x00A3}}\) \(\let \LWRref \ref \) \(\renewcommand {\ref }{\ifstar \LWRref \LWRref }\) \( \newcommand {\multicolumn }[3]{#3}\) \(\require {textcomp}\) \(\newcommand {\intertext }[1]{\text {#1}\notag \\}\) \(\let \Hat \hat \) \(\let \Check \check \) \(\let \Tilde \tilde \) \(\let \Acute \acute \) \(\let \Grave \grave \) \(\let \Dot \dot \) \(\let \Ddot \ddot \) \(\let \Breve \breve \) \(\let \Bar \bar \) \(\let \Vec \vec \) \(\require {mathtools}\) \(\newenvironment {crampedsubarray}[1]{}{}\) \(\newcommand {\smashoperator }[2][]{#2\limits }\) \(\newcommand {\SwapAboveDisplaySkip }{}\) \(\newcommand {\LaTeXunderbrace }[1]{\underbrace {#1}}\) \(\newcommand {\LaTeXoverbrace }[1]{\overbrace {#1}}\) \(\newcommand {\LWRmultlined }[1][]{\begin {multline*}}\) \(\newenvironment {multlined}[1][]{\LWRmultlined }{\end {multline*}}\) \(\let \LWRorigshoveleft \shoveleft \) \(\renewcommand {\shoveleft }[1][]{\LWRorigshoveleft }\) \(\let \LWRorigshoveright \shoveright \) \(\renewcommand {\shoveright }[1][]{\LWRorigshoveright }\) \(\newcommand {\shortintertext }[1]{\text {#1}\notag \\}\) \(\newcommand {\vcentcolon }{\mathrel {\unicode {x2236}}}\) \(\newcommand {\R }{\mathbb {R}}\) \(\newcommand {\Ab }{\mathbf {A}}\) \(\newcommand {\bv }[1]{\mathbf {#1}}\) \(\newcommand {\vb }{\bv {v}}\) \(\newcommand {\E }{\mathbb {E}}\) \(\newcommand {\W }{\text {W}}\) \(\newcommand {\norm }[1]{\|#1\|}\) \(\DeclareMathOperator {\poly }{poly}\) \(\DeclareMathOperator {\tr }{tr}\)

Improved Spectral Density Estimation via Explicit and Implicit Deflation 1

Rajarshi Bhattacharjee, Rajesh Jayaram, Cameron Musco, Christopher Musco, Archan Ray

Abstract

We study algorithms for approximating the spectral density of a symmetric matrix \(\Ab \in \R ^{n \times n}\) that is accessed through matrix-vector products. By combining an existing Chebyshev polynomial moment matching method with a deflation step that approximately projects off the largest magnitude eigendirections of \(\Ab \) before estimating the spectral density, we give an \(\epsilon \sigma _\ell (\Ab )\) error approximation in the Wasserstein-\(1\) metric using \(O(\ell \log n+1/\epsilon )\) matrix-vector products, where \(\sigma _\ell (\Ab )\) is the \(\ell ^{th}\) largest singular value of \(\Ab \). When \(\Ab \) exhibits fast singular value decay, this can be much stronger than prior work, which gives error \(\epsilon \sigma _1(\Ab )\) using \(O(1/\epsilon )\) matrix-vector products. We also show that our bound is nearly tight: \(\Omega (\ell +1/\epsilon )\) matrix-vector products are required to achieve error \(\epsilon \sigma _\ell (\Ab )\).

We further show that the popular Stochastic Lanczos Quadrature (SLQ) method matches the above bound, even though SLQ itself is parameter-free and performs no explicit deflation. This explains the strong practical performance of SLQ, and motivates a simple variant that achieves an even tighter error bound. Our error bound for SLQ leverages an analysis that views it as an implicit polynomial moment matching method, along with recent results on low-rank approximation with single-vector Krylov methods. We use these results to show that SLQ can perform implicit deflation as part of moment matching.

1 Based on paper to appear at SODA 2025. Preprint available at [BJM\(^{+}\)24]

1 Introduction

Given a symmetric matrix \(\Ab \in \R ^{n \times n}\) with eigenvalues \(\lambda _1(\Ab ), \ldots \lambda _n(\Ab )\), the spectral density of \(\Ab \) is defined as:

\begin{align*} s_{\Ab }(x)=\frac {1}{n}\sum _{i=1}^n \delta (x-\lambda _i(\Ab )), \end{align*} where \(\delta (.)\) is the Dirac delta function. The spectral density \(s_{\bv A}\) can be computed directly by performing a full eigendecomposition of \(\bv A\) in \(O(n^\omega )\) time, for \(\omega \approx 2.37\) being the exponent of fast matrix multiplication. However, when \(\Ab \) is very large or where \(\bv A\) can only be accessed through a small number of queries, we often want to approximate \(s_{\Ab }\) by some \(\tilde s_{\bv A}\) such that \(\tilde s_{\bv A}\) and \(s_{\bv A}\) are close in some metric. Spectral density estimation is applied throughout the sciences [Ski89, SR94, STBB17, SRS20], network science [FDBV01, EG17, DBB19], machine learning and deep learning in particular [RL18, PSG18, MM19, GKX19], numerical linear algebra [DNPS16, LXES19], and beyond. In this work, we focus on the Wasserstein-\(1\) (i.e., earth mover’s) distance, \(W_1(s_{\bv A}, \tilde s_{\bv A})\), which has been studied in a number of recent works giving formal approximation guarantees for SDE [CTU21, BKM22, CTU22]. Moreover, \(\Ab \) will be accessed only through matrix vector queries of the form \(\Ab \vb \) for any query vector \(\vb \). Most state-of-the-art matrix-vector query algorithms for SDE are based on Krylov subspace methods that fall into two general classes.

Moment Matching. The first class of methods approximates \(s_{\bv A}\) by approximating its polynomial moments. I.e., \(\E _{s_\bv {A}} [p(x)] = \frac {1}{n} \sum _{i=1}^n p(\lambda _i(\bv A)) = \frac {1}{n}\tr (p(\bv A))\), where \(p\) is a low-degree polynomial. We can employ stochastic trace estimation methods like Hutchinson’s method [Gir87, Hut90] to approximate this trace using just a small number of matrix-vector products with \(p(\bv A)\) and in turn \(\bv A\), since if \(p\) has degree \(k\), a single matrix-vector product with \(p(\bv A)\) can be performed using \(k\) matrix vector products with \(\bv A\). After approximating the moments for a set of low-degree polynomials (e.g., the first \(k\) monomials, or the first \(k\) Chebyshev polynomials), we can let \(\tilde s_{\bv A}\) be a distribution that matches these moments as closely as possible, and thus should closely match \(s_{\bv A}\). Moment matching methods include the popular Kernel Polynomial Method (KPM) [SR94, Wan94, WWAF06] and its variants [CPB10, LSY16, BKM22, Che23]. Braverman et al. [BKM22] analyze a Chebyshev Moment Matching method, which can be thought of as a simple variant of KPM, showing that the method can compute \(\tilde s_{\bv A}\) satisfying \(W_1(s_{\bv A},\tilde s_{\bv A}) \le \epsilon \cdot \norm {\bv A}_2\) with probability \(\ge 1-\delta \) using just \(O(b/\epsilon )\) matrix vector products, where \(b = \max (1, \frac {1}{n \epsilon ^2} \log ^2 \frac {1}{\epsilon \delta } \log ^2 \frac {1}{\epsilon })\). Note that \(b = 1\) in the common case when \(\epsilon = \tilde \Omega (1/\sqrt {n})\). Here \(\norm {\bv A}_2\) denotes the spectral norm of \(\bv A\) – i.e., its largest eigenvalue magnitude. They prove a similar guarantee for KPM itself, but with a worse dependence on \(\epsilon \).

Lanczos-Based Methods. This class of methods computes a small number of approximate eigenvalues of \(\bv A\) using the Lanczos method, and lets \(\tilde s_{\bv A}\) be a distribution supported on these eigenvalues, with appropriately chosen probability mass placed at each. The canonical method of this form is Stochastic Lanczos Quadrature (SLQ) [CTU21, GM09]. Many other variants have also been studied. Some place probability mass not just at the approximate eigenvalues, but on Gaussian or other simple distributions centered at these eigenvalues [LG82, BRP92, LSY16, HHK72]. Chen et al. [CTU21, CTU22] prove that the Lanczos-based SLQ method gives essentially the same approximation bound as [BKM22]: error \(\epsilon \cdot \norm {\bv A}_2\) using \(O(1/\epsilon )\) matrix-vector products when \(\epsilon = \tilde \Omega (1/\sqrt {n})\)2.

2 The notations \(\Tilde {O}\) and \(\Tilde {\Omega }\) means some logarithmic factors are present.

2 Our Results

Our main contribution is to show that both moment matching and Lanczos based methods for SDE can achieve improved bounds on \(W_1(s_{\bv A},\tilde s_{\bv A})\) that depend on \(\sigma _{l+1}(\bv A)\), the \((l+1)^{st}\) largest singular value of \(\bv A\) for some parameter \(l\), instead of \(\norm {\bv A}_2\). For matrices that exhibit spectral decay and thus have \(\sigma _{l+1}(\bv A) \ll \sigma _1(\bv A) = \norm {\bv A}_2\), our bounds can be much stronger than the bound \(W_1(s_{\bv A},\tilde s_{\bv A}) \le \epsilon \cdot \norm {\bv A}_2\) achieved in prior work, which roughly corresponds to estimating each eigenvalue to average error \(\epsilon \cdot \norm {\bv A}_2\). We also provide a lower bound showing that our bounds are near optimal upto some logarithmic factors.

2.1 Improved SDE via Moment Matching with Explicit Deflation

Our first contribution is a modification of the moment matching method of [BKM22] that first ‘deflates’ off any eigenvalue of \(\bv {A}\) with magnitude significantly larger than \(\sigma _{l+1}(\bv {A})\), before estimating the spectral density. Eigenvalue deflation is widely applied throughout numerical linear algebra to problems like linear system solving [BEPW98, FV01, GOSS16, FTU23], trace estimation [GSO17, Lin17, MMMW21], norm estimation [MNS\(^{+}\)18], and beyond [CS97]. Specifically, the method uses a block Krylov subspace method to first compute highly accurate approximations to the \(p\) largest magnitude eigenvalues of \(\bv {A}\), for some \(p \le l\), along with an orthonormal matrix \(\bv Z \in \R ^{n \times p}\) with columns approximating the corresponding eigenvectors. It uses moment matching to estimate the spectral density of \(\bv {A}\) projected away from these approximate eigendirections \((\bv I - \bv {ZZ}^T)\bv {A} (\bv I-\bv {ZZ}^T)\), achieving error \(\epsilon \sigma _{l+1}(\bv {A})\) since this matrix has spectral norm bounded by \(O(\sigma _{p+1}(\bv {A})) = O(\sigma _{l+1}(\bv {A}))\) if \(\bv Z\) is sufficiently accurate. It then modifies this approximate density to account for the probability mass at the top \(p\) eigenvalues. While block Krylov methods are well understood for related tasks like eigenvalue and eigenvector computation [Par98, Tro18], low-rank approximation [HMT11, MM15], singular value approximation [MM15, MNS\(^{+}\)18, BN23], linear system solving [LSY98, Saa03], our work requires a careful analysis of eigenvalue/eigenvector approximation with these methods that may be of independent interest. Overall, the above approach gives the following result:

  • Theorem 1 (SDE with Explicit Deflation). For any \(\epsilon \in (0,1)\), \(l \in [n]\), and any constants \(c_1,c_2 > 0\), Algorithm 1 of [BJM\(^{+}\)24] performs \(O\left (l\log n+\frac {b}{\epsilon } \right )\) matrix-vector products with \(\Ab \) where \(b=\max \left (1,\frac {1}{n\epsilon ^2}\log ^2 \frac {n}{\epsilon }\log ^2 \frac {1}{\epsilon } \right )\) and computes a probability density function \(\tilde s_{\bv {A}}\) such that, with probability at least \(1-\frac {1}{n^{c_1}}\),

    \[\W _1(s_{\Ab },\tilde s_{\bv {A}}) \leq \epsilon \cdot \sigma _{l+1}(\Ab ) +\frac {\|\Ab \|_2}{n^{c_2}}.\]

The additive error \(\frac {\|\Ab \|_2}{n^{c_2}}\) can be thought of as negligible – comparable e.g., to round-off error when directly computing \(s_{\bv {A}}\) using a full eigendecomposition in finite precision arithmetic [BGVKS22].

We further show that our algorithm is optimal amongst all matrix-vector query algorithms, up to logarithmic factors and the negligible additive error term. Our proof leverages an existing lower bound for distinguishing Wishart matrices of different ranks, previously used to give matrix-vector query lower bounds for the closely related problem of eigenvalue estimation [SW23]. Formally:

  • Theorem 2 (SDE Lower Bound). Any (possibly randomized) algorithm that given symmetric \(\bv {A} \in \R ^{n \times n}\) outputs \(\tilde s_{\bv {A}}\) such that, with probability at least \(1/2\), \(W_1(s_\bv {A},\tilde s_{\bv {A}}) \le \epsilon \sigma _{l+1}(\Ab )\) for \(\epsilon \in (0,1)\) and \(l \in [n]\) must make \(\Omega \left (l +\frac {1}{\epsilon } \right )\) (possibly adaptively chosen) matrix-vector queries to \(\Ab \).

2.2 Implicit Deflation Bounds for Stochastic Lanczos Quadrature

Our second contribution is to show that the popular Stochastic Lanczos Quadrature (SLQ) method for SDE [LSY16, CTU21] nearly matches the improved error bound of Theorem 1 for any choice of \(l\), even though SLQ is ‘parameter-free’ and performs no explicit deflation step. This result helps to justify the strong practical performance of SLQ and the observed ‘spectrum adaptive’ nature of this method as compared to standard moment matching-based methods like KPM [CTU21].

A key idea used to achieve this bound is to view SLQ as an implicit moment matching method as in [CTU21, CTU22], and to analyze it similarly to KPM and other explicit moment matching methods. We combine this analysis approach with recent work on low-rank approximation with single-vector (i.e., non-block) Krylov methods [MMM24] to show that SLQ can perform ‘implicit deflation’ as part of moment matching to achieve the improved error bound. Formally, we have:

  • Theorem 3 (SDE with SLQ). Let \(l \in [n]\), and \(\epsilon , \delta \in (0,1)\). Let \(g_{\min }=\min _{i \in [l]} \frac {\sigma _i(\Ab )-\sigma _{i+1}(\Ab )}{\sigma _i(\Ab )}\) and \(\kappa =\frac {\|\Ab \|_2}{\sigma _{l+1}(\bv {A})}\). SLQ run for \(m = O(l\log \frac {1}{g_{\min }}+\frac {1}{\epsilon }\log \frac {n \cdot \kappa }{\delta })\) iterations performs \(m\) matrix vector products with \(\bv {A}\) and outputs a probability density function \(\tilde s_{\bv {A}}\) such that, with probability at least \(1-\delta \),

    \[W_1(s_{\Ab },\tilde s_{\bv {A}}) \leq \Tilde {O}\left (\epsilon \cdot \sigma _{l+1}(\bv {A}) + \frac {\sigma _{l+1}(\Ab )}{\sqrt {n}}+\frac {l}{n}\|\Ab \|_2 \right ).\]

Theorem 3 essentially matches our result for moment matching with explicit deflation (Theorem 1) up to some small caveats, discussed below. First, the number of matrix vector products has a logarithmic dependence on the minimum gap \(g_{\min }\) amongst the top \(l\) singular values as well as the condition number \(\kappa =\frac {\| \Ab \|_2}{\sigma _{l+1}(\bv {A})}\). The dependence on the minimum gap is inherent, as non-block Krylov methods like SLQ require a dependence on \(g_{\min }\) in order to perform deflation/low-rank approximation [MMM24]. We note that, in practice, \(g_{\min }\) is generally not too small. Also, by adding a random perturbation to \(\bv {A}\) with spectral norm bounded by \(\frac {\norm {\bv {A}}_2}{\poly (n)}\), one can ensure that both \(g_{\min } \ge \frac {1}{\poly (n)}\) and \(\kappa \leq \poly (n)\) with high probability, and thus replace the \(O(l \log \frac {1}{g_{\min }})\) term with an \(O(l \log n)\) and the \(O(\frac {\log (n \kappa )}{\epsilon })\) term with \(O(\frac {\log n}{\epsilon })\), matching Theorem 1. See e.g., [MMM24].

Second, Theorem 3 has an additional error term of size \(\tilde O(\sigma _{l+1}(\bv {A})/\sqrt {n})\). This term is lower order whenever \(\epsilon = \tilde \Omega (1/\sqrt {n})\). Further, we believe that this term, along with the dependence on \(g_{\min }\) can be removed by using a variant on SLQ that is popular in practice, where the densities output by multiple independent runs of the method are averaged together to produce \(\tilde s(\bv {A})\).

Finally, Theorem 3 has an additional error term of size \(\tilde O(\norm {\bv {A}}_2 \cdot l/n)\). In the natural case when we run for \(m \ll n\) iterations and thus \(l \ll n\), this term will be small. However, it cannot be avoided: even for a matrix with rank \(\le l\) with well-separated eigenvalues, while the Lanczos method will converge to near-exact approximations to these eigenvalues (with error bounded by \(\frac {\norm {\bv {A}}_2}{n^c}\)), the probability distribution output by SLQ will not place mass exactly \(1/n\) at these approximate eigenvalues and thus will not achieve SDE error \(O(\frac {\norm {\bv {A}}_2}{n^c})\).

This limitation motivates us to introduce a simple variant of SLQ, which we call variance reduced SLQ, which places mass exactly \(1/n\) at any eigenvalue computed by Lanczos that has converged to sufficiently small error. This variant gives the following stronger error bound:

  • Theorem 4 (SDE with Variance Reduced SLQ). Let \(l \in [n]\), and \(\epsilon ,\delta \in (0,1)\). Let \(g_{\min }=\min _{i \in [l]} \frac {\sigma _i(\Ab )-\sigma _{i+1}(\Ab )}{\sigma _i(\Ab )}\) and \(\kappa =\frac {\|\Ab \|_2}{\sigma _{l+1}(\bv {A})}\). Algorithm 5 of [BJM\(^{+}\)24] run for \(m = O(l\log \frac {1}{g_{\min }}+\frac {1}{\epsilon }\log \frac {n \cdot \kappa }{\delta })\) iterations performs \(m\) matrix vector products with \(\bv {A}\) and outputs a probability density function \(\tilde s_{\bv {A}}\) such that, with probability at least \(1-\delta \), for some fixed constant \(c>0\),

    \[W_1(s_{\Ab },\tilde s_{\bv {A}}) \leq \Tilde {O}\left (\epsilon \cdot \sigma _{l+1}(\bv {A}) +\frac {\sigma _{l+1}(\Ab )}{\sqrt {n}}+ \frac {l}{n}\sigma _{l+1}(\Ab ) \right )+\frac {\| \Ab \|_2}{n^c}.\]

3 Future Work

There are a number of directions inspired by our work which can be pursued in the future.

Lanczos based Matrix Function Approximation. Variants of SLQ and Lanczos have been used to obtain algorithms for estimating general functions of the trace of \(\Ab \), \(\tr (f(A))\) [UCS17, CTU22, CH23]. The Lanczos method itself can approximate different matrix functions like rational functions very accurately [ACG\(^{+}\)24]. Our deflation based analysis, particularly that of the variance reduced SLQ, could be used to give improved spectrum adaptive bounds for all these methods.

Numerical stability. The Lanczos algorithm is known to suffer from numerical stability issues when implemented in finite precision arithmetic [Che24]. A more careful analysis of how the algorithms perform under finite precision arithmetic will be interesting. However, we note that our experiments (Section 6 of [BJM\(^{+}\)24]) show that our algorithms work pretty well in practice.

References