PDF
\(\newcommand{\footnotename}{footnote}\) \(\def \LWRfootnote {1}\) \(\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\let \LWRorighspace \hspace \) \(\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }\) \(\newcommand {\mathnormal }[1]{{#1}}\) \(\newcommand \ensuremath [1]{#1}\) \(\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } \) \(\newcommand {\setlength }[2]{}\) \(\newcommand {\addtolength }[2]{}\) \(\newcommand {\setcounter }[2]{}\) \(\newcommand {\addtocounter }[2]{}\) \(\newcommand {\arabic }[1]{}\) \(\newcommand {\number }[1]{}\) \(\newcommand {\noalign }[1]{\text {#1}\notag \\}\) \(\newcommand {\cline }[1]{}\) \(\newcommand {\directlua }[1]{\text {(directlua)}}\) \(\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}\) \(\newcommand {\protect }{}\) \(\def \LWRabsorbnumber #1 {}\) \(\def \LWRabsorbquotenumber "#1 {}\) \(\newcommand {\LWRabsorboption }[1][]{}\) \(\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }\) \(\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }\) \(\def \mathcode #1={\mathchar }\) \(\let \delcode \mathcode \) \(\let \delimiter \mathchar \) \(\def \oe {\unicode {x0153}}\) \(\def \OE {\unicode {x0152}}\) \(\def \ae {\unicode {x00E6}}\) \(\def \AE {\unicode {x00C6}}\) \(\def \aa {\unicode {x00E5}}\) \(\def \AA {\unicode {x00C5}}\) \(\def \o {\unicode {x00F8}}\) \(\def \O {\unicode {x00D8}}\) \(\def \l {\unicode {x0142}}\) \(\def \L {\unicode {x0141}}\) \(\def \ss {\unicode {x00DF}}\) \(\def \SS {\unicode {x1E9E}}\) \(\def \dag {\unicode {x2020}}\) \(\def \ddag {\unicode {x2021}}\) \(\def \P {\unicode {x00B6}}\) \(\def \copyright {\unicode {x00A9}}\) \(\def \pounds {\unicode {x00A3}}\) \(\let \LWRref \ref \) \(\renewcommand {\ref }{\ifstar \LWRref \LWRref }\) \( \newcommand {\multicolumn }[3]{#3}\) \(\require {textcomp}\) \(\newcommand {\intertext }[1]{\text {#1}\notag \\}\) \(\let \Hat \hat \) \(\let \Check \check \) \(\let \Tilde \tilde \) \(\let \Acute \acute \) \(\let \Grave \grave \) \(\let \Dot \dot \) \(\let \Ddot \ddot \) \(\let \Breve \breve \) \(\let \Bar \bar \) \(\let \Vec \vec \)

Spectral problems through the lens of optimization:
new ideas and improved algorithms?

Bart Vandereycken

Abstract

Thanks to influential works like [8, 1], many classical problems in numerical linear algebra (NLA) can be formulated as optimization problems on smooth and differentiable manifolds. The link with optimization on manifolds allows us to approach these problems from the world of numerical optimization. The archetypical example is the symmetric eigenvalue problem (EVP): the dominant \(k\)-dimensional eigenspaces of \(A\) correspond to extrema of the partial trace function

\begin{equation} \label {eq:min_f_over_Gr} f(X) = -\textrm {Trace}(X^TAX), \end{equation}

where \(X \in \mathbb {R}^{n \times k}\) is an orthonormal matrix (that is, \(X^T X = I_k\)). Due to the partial trace being invariant by orthogonal transformation on the right (that is, \(X \leadsto XQ\) with orthogonal \(Q\)), this problem is naturally stated on \(\textrm {Gr}(n,k)\), the Grassmann manifold of \(k\)-dimensional subspaces in \(\mathbb {R}^n\). Minimizing \(f\) by the Riemannian steepest descent method is, in specific cases, equivalent to the power method.

It is well known that the steepest descent method converges exponentially fast, in distance to the optimizer and in function value, when the objective function is locally strongly convex. Applied to spectral problems in NLA, a nonzero spectral gap is required to ensure uniqueness and the initial estimate has to be sufficiently close to the optimal subspace. Unfortunately, the latter condition is usually very stringent. For a symmetric matrix \(A\) with eigenvalues \(\lambda _1 \geq \cdots \geq \lambda _n\), for example, we have shown in [5] that (1) is geodesically convex in

\[ N = \left \{ \mathrm {span}(X) \in \textrm {Gr}(n,k) \colon \sin ^2 (\theta _k) \leq \frac {\lambda _k - \lambda _{k-1}}{\lambda _1 + \lambda _k} \right \}. \]

Here, \(\theta _k\) is the \(k\)th principal angle between \(\mathrm {span}(X)\) and the dominant eigenspace \(\mathrm {span}(V)\). While this is an improvement over more direct estimates that require \(\theta _k = O(\delta )\), the condition \(\theta _k = O(\sqrt {\delta })\) is still small.

Fortunately, classical (geodesic) convexity is not needed to have gradient descent converge exponentially fast. In the Euclidean case, an old result by [11] proves that the Polyak–Łojasiewicz (PL) condition,

\begin{equation} \label {eq:PL} \exists \mu >0 \quad \text {s.t.} \quad \| \nabla f(x) \|^2 \geq 2 \mu (f(x)-f^*), \quad \forall x\in \mathbb {R}^n, \end{equation}

is sufficient to guarantee fast (exponential) convergence in function value. The PL condition with constant \(\mu \) is weaker than \(\mu \) strong convexity

More recently, an even weaker notion of strong convexity that relates to convergence with respect to distance to the optimum, has been studied [7, 10, 5]. The property is called weak-quasi-strong-convexity (WQSC) and is defined in the Euclidean case as follows:

\[ \exists a > 0, \mu >0 \quad \text {s.t.} \quad f(x)-f^* \leq \frac {1}{a} \langle \nabla f(x) , x-x_p \rangle - \frac {\mu }{2} \| x-x_p\|^2, \quad \forall x \in \mathbb {R}^n, \]

with \(x_p\) the projection of \(x\) onto the solution set of minimizers of \(f\).

We have shown in [5, 2] that the manifold version of the WQSC property applies to the following spectral problems:

Once WQSC is shown to hold, it can be used to analyse accelerated versions of gradient descent [7, 6]. For the symmetric EVP, the Riemannian conjugate gradient method from [4] also leads to practical improvements when comparing to other accelerated gradient methods, like the LOBPCG method of [9].

Would it be possible to relax these generalized convexity properties even more? In other words, suppose gradient descent converges exponentially fast when started in any point in a set around the optimum, then which property does \(f\) satisfy? As shown in [3], the objective needs to be WQSC when measuring convergence in distance to the optimum. Recently, we have also shown that only the PL condition is required for convergence in function value. Hence, PL and WQSC are in some sense necessary and sufficient for a fast gradient method.

An added bonus of the optimization viewpoint is that gapless problems can be treated and analysed fairly easily. The convergence of gradient descent is no longer exponential but only algebraic.

This talk will present a general overview of these properties and highlight algorithmic and analytical applications from NLA. The contents are based on joint work with Pierre-Antoine Absil, Foivos Alimisis, and Yousef Saad.

References