PDF
\(\newcommand{\footnotename}{footnote}\) \(\def \LWRfootnote {1}\) \(\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\let \LWRorighspace \hspace \) \(\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }\) \(\newcommand {\mathnormal }[1]{{#1}}\) \(\newcommand \ensuremath [1]{#1}\) \(\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } \) \(\newcommand {\setlength }[2]{}\) \(\newcommand {\addtolength }[2]{}\) \(\newcommand {\setcounter }[2]{}\) \(\newcommand {\addtocounter }[2]{}\) \(\newcommand {\arabic }[1]{}\) \(\newcommand {\number }[1]{}\) \(\newcommand {\noalign }[1]{\text {#1}\notag \\}\) \(\newcommand {\cline }[1]{}\) \(\newcommand {\directlua }[1]{\text {(directlua)}}\) \(\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}\) \(\newcommand {\protect }{}\) \(\def \LWRabsorbnumber #1 {}\) \(\def \LWRabsorbquotenumber "#1 {}\) \(\newcommand {\LWRabsorboption }[1][]{}\) \(\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }\) \(\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }\) \(\def \mathcode #1={\mathchar }\) \(\let \delcode \mathcode \) \(\let \delimiter \mathchar \) \(\def \oe {\unicode {x0153}}\) \(\def \OE {\unicode {x0152}}\) \(\def \ae {\unicode {x00E6}}\) \(\def \AE {\unicode {x00C6}}\) \(\def \aa {\unicode {x00E5}}\) \(\def \AA {\unicode {x00C5}}\) \(\def \o {\unicode {x00F8}}\) \(\def \O {\unicode {x00D8}}\) \(\def \l {\unicode {x0142}}\) \(\def \L {\unicode {x0141}}\) \(\def \ss {\unicode {x00DF}}\) \(\def \SS {\unicode {x1E9E}}\) \(\def \dag {\unicode {x2020}}\) \(\def \ddag {\unicode {x2021}}\) \(\def \P {\unicode {x00B6}}\) \(\def \copyright {\unicode {x00A9}}\) \(\def \pounds {\unicode {x00A3}}\) \(\let \LWRref \ref \) \(\renewcommand {\ref }{\ifstar \LWRref \LWRref }\) \( \newcommand {\multicolumn }[3]{#3}\) \(\require {textcomp}\) \(\newcommand {\intertext }[1]{\text {#1}\notag \\}\) \(\let \Hat \hat \) \(\let \Check \check \) \(\let \Tilde \tilde \) \(\let \Acute \acute \) \(\let \Grave \grave \) \(\let \Dot \dot \) \(\let \Ddot \ddot \) \(\let \Breve \breve \) \(\let \Bar \bar \) \(\let \Vec \vec \) \(\require {mathtools}\) \(\newenvironment {crampedsubarray}[1]{}{}\) \(\newcommand {\smashoperator }[2][]{#2\limits }\) \(\newcommand {\SwapAboveDisplaySkip }{}\) \(\newcommand {\LaTeXunderbrace }[1]{\underbrace {#1}}\) \(\newcommand {\LaTeXoverbrace }[1]{\overbrace {#1}}\) \(\newcommand {\LWRmultlined }[1][]{\begin {multline*}}\) \(\newenvironment {multlined}[1][]{\LWRmultlined }{\end {multline*}}\) \(\let \LWRorigshoveleft \shoveleft \) \(\renewcommand {\shoveleft }[1][]{\LWRorigshoveleft }\) \(\let \LWRorigshoveright \shoveright \) \(\renewcommand {\shoveright }[1][]{\LWRorigshoveright }\) \(\newcommand {\shortintertext }[1]{\text {#1}\notag \\}\) \(\newcommand {\vcentcolon }{\mathrel {\unicode {x2236}}}\) \(\require {cancel}\) \(\newcommand {\iddots }{\mathinner {\unicode {x22F0}}}\) \(\let \fixedddots \ddots \) \(\let \fixedvdots \vdots \) \(\let \fixediddots \iddots \) \(\let \originalddots \ddots \) \(\let \originalvdots \vdots \) \(\let \originaliddots \iddots \) \(\let \originaldddot \dddot \) \(\let \originalddddot \ddddot \) \(\renewcommand {\stackrel }[3][]{\mathrel {\mathop {#3}\limits _{#1}^{#2}}}\) \(\newcommand {\stackbin }[3][]{\mathbin {\mathop {#3}\limits _{#1}^{#2}}}\) \(\require {mhchem}\) \(\def \id {\text {id}}\) \(\def \nnz {\text {nnz}}\) \(\def \all {\text {all}}\) \(\def \map {\ensuremath {\vec {\circ }}}\)

General Methods for Sparsity Structure Description and Cost Estimation

Grace Dinh, James Demmel, Zhiru Zhang

Abstract

Sparse tensor operations (especially matrix multiplications and tensor contractions in general) can be used to represent many problems in diverse fields such as genomics, machine learning, network analysis, and electronic design automation. Casting a domain-specific problem as a sparse linear algebra operation allows domain experts to leverage existing optimized software libraries and hardware. However the cost (in terms of flops, data movement/accesses, or memory footprint) of a sparse operation can vary significantly depending on the sparsity structure of its input, making the development of general high-performance tools for sparse linear algebra challenging. Estimating and bounding these costs is important for many applications: coming up with a performance objective for optimizing a software or hardware implementation, developing a notion of “peak performance” to compare a benchmark against, determining how much space to allocate for scratch or for the output of an operation, load balancing, and many more.

Cost estimation is straightforward for dense linear algebra, as the exact set of arithmetic instructions is always the same and known ahead of time. However, in the sparse case, this is not possible unless the exact sparsity structure (i.e. the locations of every nonzero) of the inputs is known. As a result, previous cost modeling approaches, e.g. [12], tend to either require that users provide a specific input matrix (precluding their use to develop and evaluate general tools) or provide results restricted to specific sparsity structures (e.g. uniformly distributed sparsity, block sparsity, band matrices). For input matrices that do not neatly fit into one of these predetermined categories, however, significant case-by-case work is required on the part of users to develop statistical models that both describe their matrices and provide good cost estimates and bounds.

This abstract sketches out an approach to generalize and automate the construction of such sparsity models, and to build cost estimates and bounds that take them into account, building on techniques from database and information theory. In Section 1, we describe a way to describe sparsity structure using matrix statistics. We then describe how to use these statistics to bound and estimate costs in Section 2, and to optimize storage formats for sparse matrices in Section 3.

1 Characterizing Sparsity Structure

Our goal in this section is to describe a framework for matrix statistics - quantities describing the sparsity structure of a matrix that are (a) well-defined for any sparse matrix, regardless of structure, and (b) can be effectively used to predict the performance of a tensor operation. The most well known matrix statistic is the number of nonzeros (\(\nnz \)) of a matrix. However, nnz alone is clearly insufficient for the cost estimation problem. Consider, for instance, the square matrix \(A\) whose first column is nonzero and whose other columns are all zero. Despite having the same input nnzs, \(A^{T}A\) and \(AA^{T}\) differ drastically in output memory footprint (and therefore data movement). As a result, accurate performance modeling requires additional statistics to describing a matrix’s sparsity structure in more detail.

One way to do so is to count the number of nonzeros in each row and column, which we refer to as the row and column counts, as in [6]. These statistics require significantly more space to store than the nnzs (as they are vectors with length equal to the number of columns and rows, respectively, of a matrix), but provide significantly more information: taking the dot product the column count of \(A\) and row count of \(B\) gives the exact number of flops required to compute \(A\times B\) . Furthermore, row and column counts can be summarized using by taking \(L^{p}\) norms for a few small \(p\). These norms provide a compact, easily generalizable way to represent how “skewed” the sparsity structure of a matrix is (e.g. how heavy-tailed the distribution of connections is in a social network graph) which can also be used to derive bounds on the cost of a matrix multiplication, as we will briefly discuss in Section 2 (see [1] for a discussion of these bounds from a database point of view).

However, row and column counts alone are insufficient to describe many forms of commonly seen sparsity patterns, e.g. band and block-sparse matrices. To represent these patterns, we will extend the notion of indices to functions of the rows and column index. For a concrete example, consider a tridiagonal matrix \(A\) indexed by \((i,j)\). All of the locations of its nonzeros take on only three distinct values of \(i-j\); as a result, “number of distinct values of \(i-j\)” is a useful statistic that allows us to encapsulate tridiagonal matrices (and band matrices in general).

To formalize and generalize, let us view a sparse matrix \(A\) indexed by \((i,j)\) as a set consisting of the location of its nonzeros: \(\{(i,j):A_{ij}\ne 0\}\). Let \(e_{1,...}:\mathbb {Z}^{2}\rightarrow \mathbb {Z}\) be some functions (such as \(i-j\) in the above example), which we will refer to as index maps. Define the following two operations:

  • Definition 1. The projection operation \(\pi _{e_{k}}\) projects its input onto dimension \(e_{k}\) - that is, \(\pi _{e_{k}}(A)\) has a nonzero at location \(l\) if there exists some nonzero value in \(A\) at location \(i,j\) such that \(e_{k}(i,j)=l\).

  • Definition 2. The selection operation \(\sigma _{e_{k}=\eta }\) returns the subset of nonzero locations \((i,j)\) in \(A\) such that \(e_{k}(i,j)=\eta \). When no value for \(\eta \) is given, the selection operator \(\sigma _{e_{k}}\) will be used to represent the list \((\sigma _{e_{k}=\eta }:\eta \in \text {Im}(e_{k}))\).

Let \(\circ \) represent function composition. If the output of \(g\) is a vector, let \(f\map g\) denote the vector obtained by applying \(f\) to every element of the output of \(g\). Then many natural matrix statistics can be represented by choosing appropriate index maps \(e_{k}\):

Furthermore, appropriately chosen index maps can be used to characterize matrices with sparsity structures that do not align with “standard” patterns. For example, the Tuma11 matrix could be decomposed into several components, each of which would have a very small value for \(\left |\cdot \right |\circ \pi _{\alpha i-j}\) (for some constant \(\alpha \)). Preliminary experiments show that computer vision methods such as Hough transforms [7] as well as modern machine learning methods such as symbolic regression [9] can be used to extract descriptive index maps from many real-world matrices that can be used to derive useful bounds; we leave further experimentation to future work.

2 Bounds from Matrix Statistics

This section describes approaches to deriving cost bounds from matrix statistics derived in Section 1. While we focus on matrix multiplication here, our approach can generalize to most “nested loop” style programs acting on sparse data; we leave such generalization to future work. As in the previous section, we will view a sparse matrix as a set whose elements are its nonzero indices. Then a sparse matrix multiplication \(A\times B\), where \(A\) is indexed by \((i,j)\) and \(B\) by \((j,k)\), can be viewed as the set of nontrivial arithmetic instructions - that is, \(\{(i,j,k):A_{ij}\ne 0,B_{jk}\ne 0\}\), which we denote \(T\). Note that this matrix multiplication tensor can be viewed as the database join \(A(i,j)\wedge B(j,k)\). Several cost functions immediately fall from this representation:

One approach we can take to bounding these quantities is to transform the indices of the nested loops in such a way that the resulting loop nest, when treated as a dense operation, produces useful bounds. For instance, suppose we wish to multiply two band matrices \(A\) and \(B\), which have band width \(w_{1}\) and \(w_{2}\) respectively:

\begin{align*} & \text {for }i,j,k\in [0,N)^{3}\\ & \quad C(i,k)+=A(i,j)\times B(j,k) \end{align*} As the two matrices are banded, we know that \(\left |\cdot \right |\circ \pi _{i-j}=w_{1}\) and \(\left |\cdot \right |\circ \pi _{k-j}=w_{2}\). As a result, if we let \(e_{1}=i-j\) and \(e_{2}=k-j\), we can rewrite this nested of loops as:

\begin{align*} & \text {for }e_{1}\in [0,w_{1}),e_{2}\in [0,w_{2}),j\in [0,N)\\ & \quad \quad C(e_{1}+j,e_{2}+j)+=A(e_{1}+j,j)\times B(j,e_{2}+j) \end{align*} which provides an upper bound for flops of \(w_{1}w_{2}N\). Furthermore, using Brascamp-Lieb inequalities [5, 10, 4] provides an arithmetic intensity upper bound (on a system with cache size \(M\)) of \(\sqrt {M}/2\).

Unfortunately, this method is not easily generalized: we were able to transform indices \(i\) and \(k\) into new indices that could easily be bounded using the given matrix statistics because \(A\) and \(B\) shared band structure; this would not be possible if they were not. To address this problem, we adapt information-theoretic techniques previously used for database cardinality estimation [3, 2]. Specifically, given any probability distribution over set of arithmetic instructions \(T\) in the sparse matrix multiplication, let \(h\) denote the Shannon entropies of its marginal distributions \(h\) (e.g. use \(h(ij)\) to denote the entropy of the marginal distribution over \(i\),\(j\)). Clearly, \(h(ijk)\) is upper bounded by \(\lg \left |T\right |\), the number of flops of the matrix multiplication. Furthermore, notice that for an instruction \((i,j,k)\) to be in \(T\), \(A_{ij}\) and \(B_{jk}\) must both be nonzero; as a result, the entropies \(h(ij)\) and \(h(jk)\) are upper bounded by \(\lg \nnz (A)\) and \(\lg \nnz (B)\) respectively. These inequalities can be combined with those inherent to entropy (nonnegativity, submodularity, and subadditivity) to produce bounds on cost.

For example, it can be shown that for any distribution on \(i,j,k\):

\[ 3h(ijk)\le h(i,j)+h(j,k)+h(i,k)+h(j|i)+h(k|j)+h(i|k) \]

Letting the distribution be the uniform distribution over \(T\) sets the left side of the above inequality to \(\lg \left (\#\text {flops}^{3}\right )\), while \(h(i,j)\), \(h(j,i)\), and \(h(i,k)\) are upper bounded by \(\lg \nnz A\), \(\lg \nnz B\), and \(\lg \nnz C\) respectively. Furthermore, \(h(j|i)\) is upper bounded by the log of the maximum number of nonzero elements in any row of \(A\) (similarly for the remaining terms), giving an inequality that ties together computation cost, output size, and memory footprint. In this framework, all of the cost functions above can be described: number of flops and output size are immediately derivable from entropic inequalities, and arithmetic intensity can be found by adding constraints the entropies \(h(ij)\) and \(h(jk)\) are upper bounded by \(\lg M\). In order to adapt matrix statistics using arbitrary index maps (e.g. \(e_{1}=i-j\)), we can add additional constraints: specifically that \(h(e_{1}|ij)=h(i|je_{1})=h(j|ie_{1})=0\). This allows for the automated construction of new lower bounds for, say, the cost of multiplying of a band matrix by a block-sparse one, based on statistics such as the number of dense blocks sharing an index with a given band.

3 Matrix Format Optimizations

We also wish to find efficient ways to store sparse matrices. Consider, for example, a band matrix with a small band width. Standard sparse matrix formats, such as CSR, would require significantly more storage for metadata (row pointers and column indices) than a similar format indexed by \(i-j\) and \(j\) [11]. Furthermore, the order in which the indices are stored can significantly affect size and performance too - just as \((i,j)\) (CSR) and \((j,i)\) (CSC) are significantly different formats, so would \((i-j,j)\) and \((j,i-j)\).

The choice of data structures and layouts directly impacts computing performance. For instance, to efficiently use the Gustavson algorithm, the operand tensors should ideally be stored in row-major formats. We will describe how entropic bounds (specifically, the chain bound) can suggest optimal orderings and data structures for sparse matrix storage formats.

However, performance is often heavily affected by the underlying hardware architecture. For parallel processing systems like GPUs, maintaining workload balance often outweighs achieving a high compression ratio in terms of format selection. As a result, formats with zero padding, such as ELLPACK, are commonly preferred over those that store only non-zero elements. Blocking formats, while introducing additional memory access and metadata overhead on architectures with a unified memory model, are well-suited for many-core architectures with banked memory. Work is ongoing to extend our cost models to account for hardware-specific performance factors.

References