PDF
\(\newcommand{\footnotename}{footnote}\) \(\def \LWRfootnote {1}\) \(\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\let \LWRorighspace \hspace \) \(\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }\) \(\newcommand {\mathnormal }[1]{{#1}}\) \(\newcommand \ensuremath [1]{#1}\) \(\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } \) \(\newcommand {\setlength }[2]{}\) \(\newcommand {\addtolength }[2]{}\) \(\newcommand {\setcounter }[2]{}\) \(\newcommand {\addtocounter }[2]{}\) \(\newcommand {\arabic }[1]{}\) \(\newcommand {\number }[1]{}\) \(\newcommand {\noalign }[1]{\text {#1}\notag \\}\) \(\newcommand {\cline }[1]{}\) \(\newcommand {\directlua }[1]{\text {(directlua)}}\) \(\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}\) \(\newcommand {\protect }{}\) \(\def \LWRabsorbnumber #1 {}\) \(\def \LWRabsorbquotenumber "#1 {}\) \(\newcommand {\LWRabsorboption }[1][]{}\) \(\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }\) \(\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }\) \(\def \mathcode #1={\mathchar }\) \(\let \delcode \mathcode \) \(\let \delimiter \mathchar \) \(\def \oe {\unicode {x0153}}\) \(\def \OE {\unicode {x0152}}\) \(\def \ae {\unicode {x00E6}}\) \(\def \AE {\unicode {x00C6}}\) \(\def \aa {\unicode {x00E5}}\) \(\def \AA {\unicode {x00C5}}\) \(\def \o {\unicode {x00F8}}\) \(\def \O {\unicode {x00D8}}\) \(\def \l {\unicode {x0142}}\) \(\def \L {\unicode {x0141}}\) \(\def \ss {\unicode {x00DF}}\) \(\def \SS {\unicode {x1E9E}}\) \(\def \dag {\unicode {x2020}}\) \(\def \ddag {\unicode {x2021}}\) \(\def \P {\unicode {x00B6}}\) \(\def \copyright {\unicode {x00A9}}\) \(\def \pounds {\unicode {x00A3}}\) \(\let \LWRref \ref \) \(\renewcommand {\ref }{\ifstar \LWRref \LWRref }\) \( \newcommand {\multicolumn }[3]{#3}\) \(\require {textcomp}\) \(\newcommand {\intertext }[1]{\text {#1}\notag \\}\) \(\let \Hat \hat \) \(\let \Check \check \) \(\let \Tilde \tilde \) \(\let \Acute \acute \) \(\let \Grave \grave \) \(\let \Dot \dot \) \(\let \Ddot \ddot \) \(\let \Breve \breve \) \(\let \Bar \bar \) \(\let \Vec \vec \)

Matrix Analysis and Fast Solvers for Neural Network Computations

Jianlin Xia

Abstract

Neural networks provide a powerful tool for machine learning and other data science techniques. They can also serve as new ways for mathematical and numerical tasks such as function approximations and PDE solutions. Although there have been significant developments in neural network methods, the analysis of relevant matrices and the design of relevant fast and stable matrix computation techniques are typically overlooked.

In fact, neural network methods provide highly interesting new opportunities to perform matrix analysis and design matrix algorithms that can benefit modern data analysis and machine learning. Examples of scenarios where large and challenging matrices arise include the following.

In this talk, we aim to bridge the gap between some neural network methods and fast and reliable matrix computations. We present rigorous analysis for some of these matrices and show two aspects.

In particular, consider the approximation of a function \(u:\Omega (\subset \mathbb {R}^d)\to \mathbb {R}\) by

\[ v=\sum _{i=1}^{n}c_{i}\sigma ({\mathbf {w}}_{i}^T\mathbf {x}+b_i),\quad \mathbf {x}\in \Omega , \]

where \(\mathbf {w}_i\)’s are weight vectors, \(b_i\)’s are biases, \(c_i\)’s are scalar coefficients, and \(\sigma (t)=\max \{t,0\}\) is the ReLU function. Let \(W=(\mathbf {w}_1,\ldots ,\mathbf {w}_n)\), \(\mathbf {b}=(b_1,\ldots ,b_n)^T\), and \(\mathbf {c}=(c_1,\ldots ,c_n)\). The least-squares approximation of \(u\) by \(v\) solves the following optimization problem:

\[ \min _{W,\mathbf {b},\mathbf {c}} \mathcal {J} \quad \text {with}\quad \mathcal {J}:= \left <v-u,v-u\right >= \frac 12 \int _{\Omega } (v-u)^2d\mathbf {x}.\]

By viewing \(\mathbf {c}\) as a set of linear parameters and \(\{W, \mathbf {b}\}\) as nonlinear parameters, we can look at the gradient of \(\mathcal {J}\) with respect to one of the two sets of parameters with the other set fixed [1]. Setting the gradients to be \(0\) leads to a linear system for the linear parameters and a nonlinear system for the nonlinear parameters. The former system has a mass matrix \(\tilde {A}\) as the coefficient matrix. The solution of the latter system with Newton or Gauss-Newton methods lead to linear systems involving a Hessian or Gauss-Newton matrix \(\tilde {H}\). In a highly simplified setting, \(\tilde {A}\) and \(\tilde {H}\) may be related to the following matrix forms, respectively:

\begin{align*} A&=(A_{ij})_{n\times n},\quad A_{ij}=\left < \sigma ({\mathbf {w}}_{i}^T\mathbf {x}+b_i),\sigma ({\mathbf {w}}_{j}^T\mathbf {x}+b_j\right >,\\ H&=(H_{ij})_{n\times n},\quad H_{ij}=\left < h({\mathbf {w}}_{i}^T\mathbf {x}+b_i),h({\mathbf {w}}_{j}^T\mathbf {x}+b_j\right >, \end{align*} where \(h(t)=\sigma '(t)=\begin {cases} 1, & t>0, \\ 0, & t<0. \end {cases}\)

Some interesting matrix analysis may be performed for \(A\) and \(H\). For example, we can show the following aspects.

These problems thus give a nice opportunity to apply structured matrix methods. This also shows how advanced matrix analysis may benefit modern neural network methods. The talk includes joint with Z. Cai, T. Ding, M. Liu, and X. Liu.

References