PDF
\(\newcommand{\footnotename}{footnote}\) \(\def \LWRfootnote {1}\) \(\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\let \LWRorighspace \hspace \) \(\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }\) \(\newcommand {\mathnormal }[1]{{#1}}\) \(\newcommand \ensuremath [1]{#1}\) \(\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } \) \(\newcommand {\setlength }[2]{}\) \(\newcommand {\addtolength }[2]{}\) \(\newcommand {\setcounter }[2]{}\) \(\newcommand {\addtocounter }[2]{}\) \(\newcommand {\arabic }[1]{}\) \(\newcommand {\number }[1]{}\) \(\newcommand {\noalign }[1]{\text {#1}\notag \\}\) \(\newcommand {\cline }[1]{}\) \(\newcommand {\directlua }[1]{\text {(directlua)}}\) \(\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}\) \(\newcommand {\protect }{}\) \(\def \LWRabsorbnumber #1 {}\) \(\def \LWRabsorbquotenumber "#1 {}\) \(\newcommand {\LWRabsorboption }[1][]{}\) \(\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }\) \(\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }\) \(\def \mathcode #1={\mathchar }\) \(\let \delcode \mathcode \) \(\let \delimiter \mathchar \) \(\def \oe {\unicode {x0153}}\) \(\def \OE {\unicode {x0152}}\) \(\def \ae {\unicode {x00E6}}\) \(\def \AE {\unicode {x00C6}}\) \(\def \aa {\unicode {x00E5}}\) \(\def \AA {\unicode {x00C5}}\) \(\def \o {\unicode {x00F8}}\) \(\def \O {\unicode {x00D8}}\) \(\def \l {\unicode {x0142}}\) \(\def \L {\unicode {x0141}}\) \(\def \ss {\unicode {x00DF}}\) \(\def \SS {\unicode {x1E9E}}\) \(\def \dag {\unicode {x2020}}\) \(\def \ddag {\unicode {x2021}}\) \(\def \P {\unicode {x00B6}}\) \(\def \copyright {\unicode {x00A9}}\) \(\def \pounds {\unicode {x00A3}}\) \(\let \LWRref \ref \) \(\renewcommand {\ref }{\ifstar \LWRref \LWRref }\) \( \newcommand {\multicolumn }[3]{#3}\) \(\require {textcomp}\) \(\newcommand {\bm }[1]{\boldsymbol {#1}}\)

Efficient sample average approximation techniques for hyperparameter estimation in Bayesian inverse problems

Julianne Chung, Malena Sabaté Landman, Scot M. Miller, Arvind K. Saibaba

Abstract

Inverse problems arise in many important applications, where the aim is to estimate some unknown inverse parameters from given observations. For large-scale problems where the number of unknowns can be large (e.g., due to the desire to reconstruct high-resolution images or dynamic image reconstructions) or for problems where observational datasets are huge, estimating the inverse parameters can be a computationally challenging task. Although there have been significant advancements in solving inverse problems, many of these approaches rely on a pre-determined, carefully-tuned set of hyperparameters (e.g., that define the noise and prior models) that must be estimated from the data. The need to estimate these hyperparameters further exacerbates the problem, often requiring repeated solves for many combinations of hyperparameters. In this work, we propose a sample average approximation (SAA) method that couples a Monte Carlo estimator with a preconditioned Lanczos method for the efficient estimation of hyperparameters in Bayesian inverse problems.

We are interested in linear inverse problems that involve recovering the parameters \(\mathbf {s} \in \mathbb {R}^n\) from measurements \(\mathbf {d}\in \mathbb {R}^m\), which have been corrupted by additive Gaussian measurement noise, \(\boldsymbol {\eta } \in \mathbb {R}^{m}\), and takes the form

\[ \mathbf {d} = \mathbf {A} \mathbf {s} + \boldsymbol {\eta }, \qquad \boldsymbol {\eta } \sim \mathcal {N}(\mathbf {0}, \mathbf {R}(\boldsymbol {\theta }))\]

where \(\mathbf {A} \in \mathbb {R}^{m \times n}\) represents the forward map and \(\boldsymbol {\theta } \in \mathbb {R}_+^{K}\), represents the (nonnegative) hyperparameters. In the hierarchical Bayes approach, we treat \(\boldsymbol {\theta }\) as a random variable, which we endow with prior density \(\pi _{\rm hyp}(\boldsymbol {\theta })\). We assume that the noise covariance matrix \(\mathbf {R} : \mathbb {R}_+^{K} \, \to \mathbb {R}^{m \times m},\) where \(\mathbf {R}(\cdot )\) is symmetric and positive definite (SPD), and has an inverse and square root that is computationally easy to obtain for any input (e.g., a diagonal matrix or a scalar times the identity). We assume that the prior distribution for the parameters \(\mathbf {s}\) is also Gaussian of the form \(\mathcal {N} (\boldsymbol {\mu }(\boldsymbol {\theta }), \mathbf {Q}(\boldsymbol {\theta })),\) where \(\boldsymbol {\mu }: \mathbb {R}_+^{K} \, \to \mathbb {R}^{n}\) and \(\mathbf {Q} : \mathbb {R}_+^{K} \, \to \mathbb {R}^{n \times n},\) where \(\mathbf {Q}(\cdot )\) is assumed to be SPD.

With the above assumptions, we obtain the marginal posterior density,

\begin{equation} \label {eq:marginal} \pi (\boldsymbol {\theta } \, | \, \mathbf {d}) \propto \pi _{\rm hyp}(\boldsymbol {\theta }) \det (\mathbf {\Psi }(\boldsymbol {\theta }))^{-1/2} \exp \left ( - \frac {1}{2} \| \mathbf {A} \boldsymbol {\mu }(\boldsymbol {\theta }) - \mathbf {d} \|^{2}_{\mathbf {\Psi }^{-1}(\boldsymbol {\theta })} \right ), \end{equation}

where \(\boldsymbol {\Psi }(\boldsymbol {\theta }) = \mathbf {A}\mathbf {Q}(\boldsymbol {\theta })\mathbf {A}^\top + \mathbf {R}(\boldsymbol {\theta }).\) One goal would be to draw samples (e.g., using Markov Chain Monte Carlo) from (1), and using the samples to quantify the uncertainty in the hyperparameters. However, this may be prohibitive for large-scale problems because evaluating the density function (or its logarithm) requires evaluating the determinant of and multiple solves with the matrix \(\boldsymbol {\Psi }\) that depends on \(\boldsymbol {\theta }\), which can be expensive. To compound matters, hundreds of samples are required to get accurate statistics, which can involve several hundred thousand density function evaluations.

Instead, we follow an empirical Bayes approach and focus on computing the maximum a posteriori (MAP) estimate, that is, the point estimate that maximizes the marginal posterior distribution or, equivalently, minimizes the negative log of the marginal posterior. That is, the problem of hyperparameter estimation becomes solving an optimization problem:

\begin{equation} \label {eq:fullopt} \min _{\boldsymbol {\theta } \in \mathbb {R}^K_+} \mathcal {F}(\boldsymbol {\theta }) \equiv -\log \pi _{\rm hyp}(\boldsymbol {\theta }) + \frac {1}{2} {\rm logdet}(\boldsymbol {\Psi }(\boldsymbol {\theta })) + \frac 12 \|\mathbf {A} \boldsymbol {\mu }(\boldsymbol {\theta }) - \mathbf {d}\|_{\mathbf {\Psi }(\boldsymbol {\theta })^{-1}}^2. \end{equation}

Notice that solving (2) is a computationally intensive task since it involves computing log determinants. To address this challenge, we consider an SAA method for computing the MAP estimate of the marginalized posterior distribution that combines a stochastic average approximation of the objective function and the preconditioned Lanczos method to compute efficient approximations of the function and gradient evaluations. The novel contributions of this work are as follows.

Related works. The methods we describe here have some similarity to existing literature and share certain techniques in common. The problem of optimizing for hyperparameters is closely related to parameter estimation in Gaussian processes on maximum likelihood (we may think of it as setting the forward operator as the identity matrix). The literature on this topic is vast, but we mention a few key references that are relevant to our approach. In [3], the authors propose a matrix-free approach to estimate the hyperparameters and also use an SAA for optimization. In [2], the authors propose a reformulation of the problem that avoids computing the inversion of the (prior) covariance matrix. Approaches based on hierarchical matrices are considered in [8, 10, 1]. Preconditioned Lanczos methods for estimating the log-determinant and its gradient are considered in [6, 7]. However, the main difference is that the Gaussian process methods do not involve forward operators. This raises two issues: first, we have to account for the problem structure which is different from Gaussian processes, and second, we have to account for the computational cost of the forward operator (and its adjoint), which may be comparable or greater than the cost of the covariance matrices.

On the inverse problem side, there have been relatively few works on computing the hyperparameters by optimization. Several works (e.g., [4]) instead use sampling methods (e.g., Markov Chain Monte Carlo), but these methods are extremely expensive since they require several thousand evaluations of the likelihood to achieve accurate uncertainty estimates. In [9], we developed efficient methods for hyperparameter estimation based on low-rank approximations using the generalized Golub-Kahan iterative method. A brief review of other techniques is also given in the same paper.

References