PDF
\(\newcommand{\footnotename}{footnote}\) \(\def \LWRfootnote {1}\) \(\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\let \LWRorighspace \hspace \) \(\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }\) \(\newcommand {\mathnormal }[1]{{#1}}\) \(\newcommand \ensuremath [1]{#1}\) \(\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } \) \(\newcommand {\setlength }[2]{}\) \(\newcommand {\addtolength }[2]{}\) \(\newcommand {\setcounter }[2]{}\) \(\newcommand {\addtocounter }[2]{}\) \(\newcommand {\arabic }[1]{}\) \(\newcommand {\number }[1]{}\) \(\newcommand {\noalign }[1]{\text {#1}\notag \\}\) \(\newcommand {\cline }[1]{}\) \(\newcommand {\directlua }[1]{\text {(directlua)}}\) \(\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}\) \(\newcommand {\protect }{}\) \(\def \LWRabsorbnumber #1 {}\) \(\def \LWRabsorbquotenumber "#1 {}\) \(\newcommand {\LWRabsorboption }[1][]{}\) \(\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }\) \(\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }\) \(\def \mathcode #1={\mathchar }\) \(\let \delcode \mathcode \) \(\let \delimiter \mathchar \) \(\def \oe {\unicode {x0153}}\) \(\def \OE {\unicode {x0152}}\) \(\def \ae {\unicode {x00E6}}\) \(\def \AE {\unicode {x00C6}}\) \(\def \aa {\unicode {x00E5}}\) \(\def \AA {\unicode {x00C5}}\) \(\def \o {\unicode {x00F8}}\) \(\def \O {\unicode {x00D8}}\) \(\def \l {\unicode {x0142}}\) \(\def \L {\unicode {x0141}}\) \(\def \ss {\unicode {x00DF}}\) \(\def \SS {\unicode {x1E9E}}\) \(\def \dag {\unicode {x2020}}\) \(\def \ddag {\unicode {x2021}}\) \(\def \P {\unicode {x00B6}}\) \(\def \copyright {\unicode {x00A9}}\) \(\def \pounds {\unicode {x00A3}}\) \(\let \LWRref \ref \) \(\renewcommand {\ref }{\ifstar \LWRref \LWRref }\) \( \newcommand {\multicolumn }[3]{#3}\) \(\require {textcomp}\) \(\newcommand {\intertext }[1]{\text {#1}\notag \\}\) \(\let \Hat \hat \) \(\let \Check \check \) \(\let \Tilde \tilde \) \(\let \Acute \acute \) \(\let \Grave \grave \) \(\let \Dot \dot \) \(\let \Ddot \ddot \) \(\let \Breve \breve \) \(\let \Bar \bar \) \(\let \Vec \vec \) \(\newcommand {\LWRsubmultirow }[2][]{#2}\) \(\newcommand {\LWRmultirow }[2][]{\LWRsubmultirow }\) \(\newcommand {\multirow }[2][]{\LWRmultirow }\) \(\newcommand {\mrowcell }{}\) \(\newcommand {\mcolrowcell }{}\) \(\newcommand {\STneed }[1]{}\) \(\newcommand {\toprule }[1][]{\hline }\) \(\let \midrule \toprule \) \(\let \bottomrule \toprule \) \(\def \LWRbooktabscmidruleparen (#1)#2{}\) \(\newcommand {\LWRbooktabscmidrulenoparen }[1]{}\) \(\newcommand {\cmidrule }[1][]{\ifnextchar (\LWRbooktabscmidruleparen \LWRbooktabscmidrulenoparen }\) \(\newcommand {\morecmidrules }{}\) \(\newcommand {\specialrule }[3]{\hline }\) \(\newcommand {\addlinespace }[1][]{}\) \(\def \gL {{\mathcal {L}}}\) \(\def \gU {{\mathcal {U}}}\) \(\def \gS {{\mathcal {S}}}\) \(\def \sN {{\mathbb {N}}}\) \(\newcommand {\mat }[1]{\boldsymbol {#1}}\) \(\renewcommand {\vec }[1]{\boldsymbol {\mathrm {#1}}}\) \(\newcommand {\vecalt }[1]{\boldsymbol {#1}}\) \(\newcommand {\va }{\ensuremath {\vec {a}}}\) \(\newcommand {\mA }{\ensuremath {\mat {A}}}\) \(\newcommand {\wpar }[1]{\ensuremath {\left (#1\right )}}\) \(\newcommand {\wcur }[1]{\ensuremath {\left \{#1\right \}}}\)

Parallel Incomplete Factorization Preconditioners

Erik G Boman1, Marc A. Tunnell2

Abstract

Incomplete factorizations are popular preconditioners and are well known to be effective for a wide range of problems. Additionally, these preconditioners can be used as a “black box” and do not rely on any a priori knowledge of the problem. However, traditional algorithms for computing these incomplete factorizations are based on Gaussian elimination and do not parallelize well. Recently, a more parallel incomplete factorization algorithm was proposed by Chow and Patel [4], where the factors are computed iteratively. Here we propose a new iterative approach that is based on alternating triangular solves of \(L\) and \(U\). We develop two versions: ATS-ILU for a static sparsity pattern, and ATS-ILUT for a dynamic pattern (using thresholding). We show that this new method is similar to the fine-grained iterative ILU method by Chow but has the added advantage that it allows greater reuse of memory and is fully deterministic in parallel, meaning the results do not depend on scheduling. We evaluate the new method on several test matrices from the SuiteSparse collection and show that it is competitive with current ILU methods. When short setup time is important, it is typically better than other methods.

1 Sandia National Labs, egboman@sandia.gov

2 Purdue University, mtunnell@purdue.edu

1 Introduction

Preconditioning is well known to be essential for improving the speed of convergence of Krylov methods such as Conjugate Gradient (CG) and Generalized Minimal Residual (GMRES) [8]. Incomplete Lower-Upper (ILU) factorizations are a popular class of preconditioners as they can be used as a “black box” on a wide range of problems. There are two main types of ILU factorizations, level-based ILU(k) [3, 6, 7] and threshold-based ILUT [9]. However, these methods are inherently sequential and do not parallelize well.

There has been interest in the parallelization of these more classical interpretations of ILU, largely through graph partitioning schemes. These graph partition-based methods, such as [5], offer a promising approach to parallelizing classical ILU methods. By decomposing the graph corresponding to the matrix and determining variables that can be eliminated in parallel, these methods aim to distribute the computational load more evenly across processors. While these strategies have shown effectiveness for certain types of problems [3], their implementation can be highly complex. Additionally, their performance can be problem-dependent, requiring consideration of the underlying graph structure when choosing a parallelization strategy.

More recently, there have been strides into methods of computing ILU factors iteratively, potentially giving up some of the robustness of the classical methods for better parallel properties [4]. Iterative ILU methods, such as those introduced by Chow [4], offer significant advantages in terms of scalability on modern parallel architectures. For the remainder of this paper, we refer to the method introduced by Chow as ParILU and its thresholded counterpart as ParILUT [1, 4]. These methods approximate the ILU factors through a series of iterative updates, which can be more easily distributed across multiple processors or offloaded to accelerators.

By breaking down each iterative update into smaller approximate subproblems and solving them independently, different parts of the factorization can be computed in parallel without the need for complex graph-partitioning algorithms. This approach allows for the use of iterative ILU methods on a wide range of problems, including those with complex or irregular graph structures that may preclude high levels of parallelism in the graph-partitioned classical ILU methods.

Furthermore, iterative ILU methods are adaptable to various hardware accelerators such as graphics processing units (GPUs) [2], which are increasingly important for high-performance computing. By leveraging the parallel processing capability of these accelerators, iterative ILU methods can significantly reduce the real-world time required to compute the ILU factors for large-scale problems, thereby speeding up the overall solution process.

In this paper, we propose a new class of iteratively-computed ILU preconditioners, which we call Alternating triangular Solves ILU (ATS-ILU). This method builds upon the strengths of existing iterative ILU approaches while leveraging improved memory reuse and determinism in parallel. We provide an analysis of the method and evaluate its performance compared to the state of the art on a variety of test matrices. We show that our method is competitive with current ILU methods and has the potential to be a powerful tool for solving large-scale problems on modern parallel architectures.

2 Alternating Triangular Solves Method

In this section, we introduce our new method for computing ILU factors, ATS-ILU. This method is based on the idea of alternating iterative updates to the \(L\) and \(U\) factors of the matrix \(A\). The basic idea is the same as before, where we iteratively update the factors \(L\) and \(U\) until convergence, but where the updates are performed in an alternating manner. This general process is a common method for solving bilinear systems and is outlined in 1.

Algorithm 1: Alternating ILU

  • \(U^{(0)} \gets \texttt {triu}(A)\)

  • \(k \gets 0\)

  • while not converged do

  • Solve \(L^{(k)} U^{(k)} \approx A\) for \(L^{(k)}\)

  • Solve \(L^{(k)} U^{(k+1)} \approx A\) for \(U^{(k+1)}\)

  • Check convergence

  • \(k \gets k + 1\)

  • end while

One way to perform this procedure would be to perform a triangular solve with the entirety of \(U^{(k)}\) and let \(A\) be the right-hand side vector to solve for \(L^{(k+1)}\), and similar to solve for \(U^{(k+1)}\). This entire process can largely be performed in parallel as each row of \(L\) and column of \(U\) can be solved independently. Despite the potential for high levels of parallelism, it is still extremely computationally expensive and likely suffers from significant levels of fill-in during intermediate steps. The computational cost could be reduced by using an approximation.

Additionally, the algorithm as stated above does not guarantee that \(L\) and \(U\) remain lower and upper triangular, respectively. One method to address this issue would be to solve for \(L\) only in the lower triangular part of \(A\) and for \(U\) only in the upper triangular part of \(A\). This would ensure that the factors remain lower and upper triangular, respectively, but would still leave the problem of significant levels of fill-in. Instead, we suggest a more practical approach where we impose a sparsity pattern on \(L\) and \(U\), namely \(\gL \) and \(\gU \), respectively. This sparsity pattern can be chosen to be the same as the sparsity pattern of \(A\), which is the choice we make in this paper.

In order to get around the issue of fill-in, we propose a method where we approximately solve for \(L\) and \(U\) along their given sparsity patterns, which we discuss next.

2.1 Alternating Triangular Solves ILU Algorithm

The ATS-ILU algorithm is based on the idea of approximately solving for \(L\) and \(U\) in an alternating fashion along only their given sparsity patterns. Again, the rows of \(L\) and the columns of \(U\) can be solved independently, allowing for a high level of parallelism. The algorithm is shown in 2. We present the algorithm for a general pattern \(\gS \) but in practice, this will correspond to the pattern of \(A^k\) for some small power \(k\).

Algorithm 2: ATS-ILU

  • 1: Input: Sparse matrix \(A\), sparsity pattern \(\gS \), starting factors \(L\) and \(U\)

  • 2: while not converged do

  • 3: for \(i \in \wcur {1~2~\dots ~n}\) do

  • 4: \(\text {idx} \gets \wcur {j \in \sN ~|~ \wpar {i,j} \in \gS , j \leq i}\)

  • 5: \(\ell _{i, \text {idx}} \gets a_{i, idx}\wpar {U_{\text {idx},\text {idx}}}^{-1}\)

  • 6: end for

  • 7: for \(j \in \wcur {1~2~\dots ~n}\) do

  • 8: \(\text {idx} \gets \wcur {i \in \sN ~|~ \wpar {i,j} \in \gS , i \geq j}\)

  • 9: \(u_{\text {idx}, j} \gets \wpar {L_{\text {idx},\text {idx}}}^{-1} a_{\text {idx},j}\)

  • 10: end for

  • 11: end while

In this algorithm, we solve for each row of \(L\) and each column of \(U\) independently. Recall that the notation \(\va _{i, idx}\) refers to the \(i\)th row of \(\mA \) restricted to the indices in \(\text {idx}\), and similarly for \(U_{\text {idx},\text {idx}}\) and \(L_{\text {idx},\text {idx}}\). These submatrices can be viewed as the (dense) non-contiguous submatrices of \(L\) and \(U\) that correspond to the sparsity pattern \(\gS \) along the given row or column.

Our method can be extended to do thresholding to maintain a certain fill level. We call this extension ATS-ILUT, and defer the details to the full paper.

3 Results

We implemented the ATS-ILUT algorithm in C++ with Kokkos for parallel performance portability. We show some preliminary results in Table 3 (PDF abstract only).

4 Conclusions

We have developed a new parallel iterative ILU algorithm ATS-ILU and a thresholded version ATS-ILUT. Experiments show it performs similarly to the ParILU(T) method, but it often provides a better quality preconditioner after just one or two steps (updates) of the setup. This is an advantage if setup time is important, e.g., when solving a sequence of linear systems. Also, it is naturally deterministic (though an asynchronous version is also possible).

Acknowledgments

Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA-0003525.

References