Householder Symposium

PDF

\(\newcommand{\footnotename}{footnote}\) \(\def \LWRfootnote {1}\) \(\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\let \LWRorighspace \hspace \) \(\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }\) \(\newcommand {\mathnormal }[1]{{#1}}\) \(\newcommand \ensuremath [1]{#1}\) \(\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } \) \(\newcommand {\setlength }[2]{}\) \(\newcommand {\addtolength }[2]{}\) \(\newcommand {\setcounter }[2]{}\) \(\newcommand {\addtocounter }[2]{}\) \(\newcommand {\arabic }[1]{}\) \(\newcommand {\number }[1]{}\) \(\newcommand {\noalign }[1]{\text {#1}\notag \\}\) \(\newcommand {\cline }[1]{}\) \(\newcommand {\directlua }[1]{\text {(directlua)}}\) \(\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}\) \(\newcommand {\protect }{}\) \(\def \LWRabsorbnumber #1 {}\) \(\def \LWRabsorbquotenumber "#1 {}\) \(\newcommand {\LWRabsorboption }[1][]{}\) \(\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }\) \(\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }\) \(\def \mathcode #1={\mathchar }\) \(\let \delcode \mathcode \) \(\let \delimiter \mathchar \) \(\def \oe {\unicode {x0153}}\) \(\def \OE {\unicode {x0152}}\) \(\def \ae {\unicode {x00E6}}\) \(\def \AE {\unicode {x00C6}}\) \(\def \aa {\unicode {x00E5}}\) \(\def \AA {\unicode {x00C5}}\) \(\def \o {\unicode {x00F8}}\) \(\def \O {\unicode {x00D8}}\) \(\def \l {\unicode {x0142}}\) \(\def \L {\unicode {x0141}}\) \(\def \ss {\unicode {x00DF}}\) \(\def \SS {\unicode {x1E9E}}\) \(\def \dag {\unicode {x2020}}\) \(\def \ddag {\unicode {x2021}}\) \(\def \P {\unicode {x00B6}}\) \(\def \copyright {\unicode {x00A9}}\) \(\def \pounds {\unicode {x00A3}}\) \(\let \LWRref \ref \) \(\renewcommand {\ref }{\ifstar \LWRref \LWRref }\) \( \newcommand {\multicolumn }[3]{#3}\) \(\require {textcomp}\) \(\newcommand {\intertext }[1]{\text {#1}\notag \\}\) \(\let \Hat \hat \) \(\let \Check \check \) \(\let \Tilde \tilde \) \(\let \Acute \acute \) \(\let \Grave \grave \) \(\let \Dot \dot \) \(\let \Ddot \ddot \) \(\let \Breve \breve \) \(\let \Bar \bar \) \(\let \Vec \vec \)

Spectral problems through the lens of optimization:
new ideas and improved algorithms?

Bart Vandereycken

Abstract

Thanks to influential works like [8, 1], many classical problems in numerical linear algebra (NLA) can be formulated as optimization problems on smooth and differentiable manifolds. The link with optimization on manifolds allows us to approach these problems from the world of numerical optimization. The archetypical example is the symmetric eigenvalue problem (EVP): the dominant \(k\)-dimensional eigenspaces of \(A\) correspond to extrema of the partial trace function

\begin{equation} \label {eq:min_f_over_Gr} f(X) = -\textrm {Trace}(X^TAX), \end{equation}

where \(X \in \mathbb {R}^{n \times k}\) is an orthonormal matrix (that is, \(X^T X = I_k\)). Due to the partial trace being invariant by orthogonal transformation on the right (that is, \(X \leadsto XQ\) with orthogonal \(Q\)), this problem is naturally stated on \(\textrm {Gr}(n,k)\), the Grassmann manifold of \(k\)-dimensional subspaces in \(\mathbb {R}^n\). Minimizing \(f\) by the Riemannian steepest descent method is, in specific cases, equivalent to the power method.

It is well known that the steepest descent method converges exponentially fast, in distance to the optimizer and in function value, when the objective function is locally strongly convex. Applied to spectral problems in NLA, a nonzero spectral gap is required to ensure uniqueness and the initial estimate has to be suﬀiciently close to the optimal subspace. Unfortunately, the latter condition is usually very stringent. For a symmetric matrix \(A\) with eigenvalues \(\lambda _1 \geq \cdots \geq \lambda _n\), for example, we have shown in [5] that (1) is geodesically convex in

\[ N = \left \{ \mathrm {span}(X) \in \textrm {Gr}(n,k) \colon \sin ^2 (\theta _k) \leq \frac {\lambda _k - \lambda _{k-1}}{\lambda _1 + \lambda _k} \right \}. \]

Here, \(\theta _k\) is the \(k\)th principal angle between \(\mathrm {span}(X)\) and the dominant eigenspace \(\mathrm {span}(V)\). While this is an improvement over more direct estimates that require \(\theta _k = O(\delta )\), the condition \(\theta _k = O(\sqrt {\delta })\) is still small.

Fortunately, classical (geodesic) convexity is not needed to have gradient descent converge exponentially fast. In the Euclidean case, an old result by [11] proves that the Polyak–Łojasiewicz (PL) condition,

\begin{equation} \label {eq:PL} \exists \mu >0 \quad \text {s.t.} \quad \| \nabla f(x) \|^2 \geq 2 \mu (f(x)-f^*), \quad \forall x\in \mathbb {R}^n, \end{equation}

is suﬀicient to guarantee fast (exponential) convergence in function value. The PL condition with constant \(\mu \) is weaker than \(\mu \) strong convexity

More recently, an even weaker notion of strong convexity that relates to convergence with respect to distance to the optimum, has been studied [7, 10, 5]. The property is called weak-quasi-strong-convexity (WQSC) and is defined in the Euclidean case as follows:

\[ \exists a > 0, \mu >0 \quad \text {s.t.} \quad f(x)-f^* \leq \frac {1}{a} \langle \nabla f(x) , x-x_p \rangle - \frac {\mu }{2} \| x-x_p\|^2, \quad \forall x \in \mathbb {R}^n, \]

with \(x_p\) the projection of \(x\) onto the solution set of minimizers of \(f\).

We have shown in [5, 2] that the manifold version of the WQSC property applies to the following spectral problems:

• Symmetric EVP of \(A\): the objective function \(f\) in (1) is WQSC with parameters \(a(\mathrm {span}(X))= \theta _k / \tan \theta _k\) and \(\mu = 8 \delta /\pi ^2\).
• Symmetric generalized EVP of \((A,B)\) with \(B \succ 0\): the objective function

\[f(\textrm {span}(X))=-\textrm {Trace}((X^T B X)^{-1} X^T A X)\]

is WQSC with parameters \(a(\mathrm {span}(X))=\sigma _{\min }(V^T B X (X^T B X)^{-1/2})\) and \(\mu = 8 \delta /\pi ^2\).

Once WQSC is shown to hold, it can be used to analyse accelerated versions of gradient descent [7, 6]. For the symmetric EVP, the Riemannian conjugate gradient method from [4] also leads to practical improvements when comparing to other accelerated gradient methods, like the LOBPCG method of [9].

Would it be possible to relax these generalized convexity properties even more? In other words, suppose gradient descent converges exponentially fast when started in any point in a set around the optimum, then which property does \(f\) satisfy? As shown in [3], the objective needs to be WQSC when measuring convergence in distance to the optimum. Recently, we have also shown that only the PL condition is required for convergence in function value. Hence, PL and WQSC are in some sense necessary and suﬀicient for a fast gradient method.

An added bonus of the optimization viewpoint is that gapless problems can be treated and analysed fairly easily. The convergence of gradient descent is no longer exponential but only algebraic.

This talk will present a general overview of these properties and highlight algorithmic and analytical applications from NLA. The contents are based on joint work with Pierre-Antoine Absil, Foivos Alimisis, and Yousef Saad.

References

[1] P-A Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, 2008.
[2] Pierre-Antoine Absil, Foivs Alimisis, and Bart Vandereycken. Riemannian inexact gradient descent for high-dimensional canonical correlation analysis. In preparation, 2024.
[3] Foivos Alimisis. Characterization of optimization problems that are solvable iteratively with linear convergence. MTNS, 2024a.
[4] Foivos Alimisis, Yousef Saad, and Bart Vandereycken. Gradient-type subspace iteration methods for the symmetric eigenvalue problem. arXiv preprint arXiv:2306.10379, 2023.
[5] Foivos Alimisis and Bart Vandereycken. Geodesic convexity of the symmetric eigenvalue problem and convergence of steepest descent. Journal of Optimization Theory and Applications, pages 1–40, 2024.
[6] Foivos Alimisis, Simon Vary, and Bart Vandereycken. A nesterov-style accelerated gradient descent algorithm for the symmetric eigenvalue problem. arXiv preprint arXiv:2406.18433, 2024.
[7] Jingjing Bu and Mehran Mesbahi. A note on Nesterov’s accelerated method in nonconvex optimization: a weak estimate sequence approach. arXiv preprint arXiv:2006.08548, 2020.
[8] Alan Edelman, Tomás A Arias, and Steven T Smith. The geometry of algorithms with orthogonality constraints. SIAM journal on Matrix Analysis and Applications, 20(2):303–353, 1998.
[9] Andrew V Knyazev. Toward the optimal preconditioned eigensolver: Locally optimal block preconditioned conjugate gradient method. SIAM journal on scientific computing, 23(2):517–541, 2001.
[10] Ion Necoara, Yurii Nesterov, and Francois Glineur. Linear convergence of first order methods for non-strongly convex optimization. Mathematical Programming, 175:69–107, 2019.
[11] Boris Teodorovich Polyak. Gradient methods for minimizing functionals. Zhurnal vychislitel’noi matematiki i matematicheskoi fiziki, 3(4):643–653, 1963.

Spectral problems through the lens of optimization: new ideas and improved algorithms?

References

Spectral problems through the lens of optimization:
new ideas and improved algorithms?