()
SIAM J. COMPUT. Vol. 24, No. 2, pp. 227234, April 1995
1995 Society for Industrial and Applied Mathematics 002
SPARSE APPROXIMATE SOLUTIONS TO LINEAR SYSTEMS* B. K. NATARAJAN Abstract. The following problem is considered: given a matrix A in Rm’’, (m rows and n columns), a vector b in Rm, and 6 > 0, compute a vector x satisfying IIAx bl[2 <_ 6 if such exists, such that x has the fewest number of nonzero entries over all such vectors. It is shown that the problem is NPhard, but that the wellknown greedy nonzero entries, where heuristic is good in that it computes a solution with at most Opt(6/Z)llA + Opt(6/2) is the optimum number of nonzero entries at error 6/2, A is the matrix obtained by normalizing each column of A with respect to the L2 norm, and A + is its pseudoinverse.
[18
I1 ln(llbl12/6)]
Key words, sparse solutions, linear systems AMS subject classification. 68Q25
1. Introduction. The problem of computing a sparse approximate solution to a linear system is a fundamental problem in matrix computation. For matrices over the reals, the problem (variant thereof) has been studied under the name "subset selection" in statistical modeling by Golub and Van Loan (1983). For binary matrices, the problem has been studied as "minimum weight solution" in error corrective coding by Gallager (1968). The related "sparse nullspace" problem is of interest in nonlinear optimization; see Coleman and Pothen (1986). Also, the "minimum set cover" problem can be viewed as computing a sparse solution to Ax >_ b, where all the entries are binary and the inequality is entrywise; see Garey and Johnson (1979), Johnson (1974). Our specific motivation for studying the problem is as follows. A widely used technique for the interpolation of irregularly spaced samples in higher dimensions is that of radial basis interpolation; see Hardy (1988). In brief, assign each of the given sample points a coefficient. The value of the interpolant at a new point is the weighted sum of these coefficients, where the weights are the distances of the respective samples from the new point. The coefficients themselves are the solution to the linear system of equations obtained by evaluating the interpolant at the each of the given points and equating them to the respective sample values; Michelli (1986) has shown that the linear system is always nonsingular. In solving the above linear system, two issues are of interest. (1) The cost of evaluating the interpolant at a general point grows with the number of nonzero coefficients in the solution. Hence, it is desirable to have as few nonzeros as possible, while tolerating some error in the interpolation. (2) By the principle of Occam’s Razor, the interpolant with the fewest nonzero coefficients is most likely to approximate the underlying function that generated the samples. Indeed, in the presence of noise in the data, it is desirable to pick a sparse but approximate interpolant. Recently, learning theoretic justifications of Occam’s Razor in this setting have been established; see Natarajan (1993). In light of the above, we are interested in obtaining a sparse approximate solution to the linear system specifying the coefficients of the interpolant. Similar arguments can also be made with respect to other interpolation methods, such as polynomial or trigonometric interpolation. Previously, it was known that the problem is NPhard if the solution is required to be over the integers; see Garey and Johnson (1979). For matrices over the re.als, the heuristic of Golub, Klema, and Stewart (1976) is well known, as discussed in Golub and Van Loan (1983). In this paper, we show that the problem remains NPhard over the reals. We then show that the obvious and wellknown greedy heuristic (Golub and Van Loan (1983)) is provably good in that *Received by the editors November 2, 1992; accepted for publication October 29, 1993. HewlettPackard Laboratories, 1501 Page Mill Road, Palo Alto, California 94304.
227
228
B.K. NATARAJAN
the number of nonzeros reported is at most 18 Opt(e/2)IIA + 1122 In(lib 12/), where Opt(e/2) is the optimum number of nonzero entries at error e/2, A is the matrix obtained by normalizing each column of A with respect to the L2 norm, and A + is its pseudoinverse. The algorithm is essentially a marriage between the greedy set cover algorithm of Johnson (1974) and the QR algorithm for least squares solutions of linear systems of Golub and Van Loan (1983). It is important to note that minimizing the number of zeros is competitive with the stability of the algorithm, i.e., the norm of the computed solution. The heuristic of Golub, Klema, and Stewart (1976) offers stability, but not provably good sparseness, while our algorithm does the converse. An open problem is to obtain an algorithm that allows an explicit tradeoff. The algorithm consumes O(mn) time per entry reported. Since the entries of the solution are reported in order of their "significance," the proposed algorithm is a good candidate for the progressive solution of dense linear systems. Computational experiments on using the greedy algorithm for radial basis interpolation are discussed in Carlson and Natarajan (1994).
2. Hardness. Let SAS refer to the sparse approximate solution problem stated below. Problem. Given a matrix A in Rmn, (m rows and n columns), a vector b in Rm, and > 0, compute a vector x satisfying IlAx bll2 <_ if such exists, such that x has the fewest number of nonzero entries over all such vectors. We first establish that SAS is NPhard on the machine model of the infinite precision RAM; see Preparata and Shamos (1985), for instance. THEOREM 1. SAS is NPhard. Proof. The proof is by reduction from the problem of "exact cover by 3 sets," as in the proof of hardness for the problem of"minimum weight solutions to linear systems." See Garey and Johnson (1979), pp. 221 and 246). Exact Cover by 3sets. Instance. A set S, and a collection C of 3element subsets of S. Question. Does C contain an exact cover for S, i.e., a subcollection ( of C such that every element of S occurs exactly once in Let us use X3C to refer to the problem of exact cover by 3sets, and show how to transform an instance of X3C to an instance of SAS. Given is an instance of X3C: S {s, s2 Sm }, and C Cl, c2 cn. Without loss of generality we can assume that m is a multiple of 3, since otherwise there is trivially no exact cover. Let b be the vector (1, 1, 1) of m ones. The matrix A will have n column vectors, one for each set in C. Specifically, for each if set c 6 C, the corresponding column vector will have entries (Zl, z2 Zm) where zi 0 otherwise. Pick si 6 c and Zoi We show that the constructed instance of SAS has a solution with m/3 or fewer entries if and only if the given instance of X3C has a solution. If the X3C instance has a solution, if the ith set in C is included in then consider the vector x (x, x2 x), where xi the solution of the X3C instance, and xi 0 otherwise. Then, Ax b exactly and hence the SAS instance has a solution with exactly m/3 nonzero entries. Conversely, assume the SAS instance has a solution x with at most m/3 nonzero entries. Since IIAx bl12 _< each entry of Ax must be between Since each column of A has only three nonzero entries, x must have at least m/3 entries. Thus x has exactly m/3 entries. Now consider the subcollection ’, consisting of those sets ci such that the ith entry in x is nonzero. It is clear that is an exact cover for S.
.
,
3. The algorithm. The algorithm we consider is a merging of the greedy set cover algorithm of Johnson (1974) and the QR algorithm for the leastsquares problem of Golub and Van Loan (1983). In essence, it is a QR algorithm where the column pivot is chosen greedily with respect to the righthand side b of the linear system. Such a modification is well known; see
SPARSE APPROXIMATE SOLUTIONS TO LINEAR SYSTEMS
229
Golub and Van Loan (1983, p. 168, Exercise 6.4.8). In the interest of simplicity, we will not present the algorithm as a modification of the QR algorithm, but will present it as a twophase algorithm with a subset selection phase, followed by an explicit solution phase. In words, Algorithm Greedy below takes as input A, b, and e. The algorithm first normalizes each column ai of A. Then, at each iteration of the selection phase, the algorithm greedily picks that column of A that is closest in angle to the vector b. Then, b and the column vectors of A are projected onto the subspace orthogonal to the chosen column. This procedure is repeated until Ilbl12 _< In the solution phase, the algorithm solves the linear system problem Bx b () b (r) where B is the matrix consisting of those columns of A that were chosen in the selection phase, b () is b at the start of the selection of phase, and b (r) is b at the end of the selection of phase.
.
Algorithm Greedy input: matrix A, column vector b, e > 0. Subset Selection Phase: Normalize each column of A to obtain A. r < 0; r A; b () < b; A () while lib (r) 112 > e do if A (r>rb (r) 0 then

no solution exists;
else
a
choose k 1,2 n} r such that column r) of A(r) is closest to b (r), i.e., r)r b(r) is maximum; project b (r) onto the subspace orthogonal to r), i.e., b(r+ 1) <___ b(r) r <rU{k}; for j {1,2 n} r do r) onto the subspace orthogonal to r) i.e. project r)T(r)’ r)" 1) < _(r)
la
a
(ar)Tb(r)) ar)
_.ja!
a.(r+
a
(a ) a .(/r+l)/
[email protected]+l) 112;
aj tj (r+l) normalize a: i.e., _(r+l)
j
end
j
/
end r <rfl;
end Solution Phase: Solve Bx b () b (r) where B is the matrix of columns of A with indices in r. end
THEOREM 2. The number of vectors selected by Algorithm Greedy, and hence the number of nonzero entries in the computed solution, is at most
(1)
II
80pt(/2) IIA+
I. In
(’l"  Z) 1,
where Opt(/2) denotes the fewest number of nonzero entries over all solutions that satisfy
IIAx bl12 < /2.
230
B.K. NATARAJAN
Our proof of Theorem 2 is a simplified and generalized variant of that of the greedy set cover algorithm given by Johnson (1974). For each value of r, at the start of the corresponding iteration of the algorithm, let u (r) be a vector with the minimum number of nonzero entries such that IlA(r)u (r) b(r)]12 _< /2. Also, we define N (r) as the number of nonzero entries in A(r)u(r). Let be the number of iterations the algorithm runs in the selection u (r) and q(r) clear that the number of nonzero entries produced by the solution phase of the is It phase. is t. Define at most algorithm
(2)
4 max
p
N(r) llu(r)l[
O
lib(r)
Our proof proceeds in a sequence of lemmas. In the first lemma, we bound in terms of p. Lemmas 2 and 3 estimate the righthand side of the above definition of p. Combining the three lemmas, we will obtain Theorem 2. For z e R, let denote the smallest integer greater than or equal to z. LEMMA 1. The number of iterations of the algorithm is at most
z
Since U (r) only has N (r) nonzero entries, we will show that at least one of the columns of A (r) must be correspondingly close to b (r). Specifically, we establish a lower bound on [[A(r)Tb(r)[]c. Now,
Proof.
b(r)
(4)
"
U i=l
where e/2 stands for the generic error vector of L2 norm at most e/2. Hence
(5)
b(r)Tb (r)
U
(r)b(r)Tar) + b(r)T6 /2.
i1
<__
(6)
(mnax[ar)Tb(r)l) k i=1
(r)
ui
I"Ilb(r)[12 /2.
i=1
Now,
b(r)Tb (r) IIbr) ll22,
(7)
mnax
(8)
i=1
ar)Tb(r)
and
Ilb(r) ll2
(9)
[In(r) Tb(r) ,
.
Also, since U (r) has N (r) nonzero entries,
lug(r)
(10)
u(r) Ill <
%/rN(r)ilu(r) 112.
i=1
Hence, we can write (6) as (11)
[[b(r)[[ _< [[A(r)rb(r)llcx/U(r)llu(r)[[ 2 +1/2[[b(r)[[.
SPARSE APPROXIMATE SOLUTIONS TO LINEAR SYSTEMS
231
Rearranging, we get
IlA(r)Tb(r)[[o
(12)
lib(r) 22 2/N(r) llu(r)ll2
Using the above lower bound on A(r)T b(r)I1 we now estimate the rate at which b (r) diminishes at each iteration of the algorithm. Now,
b(r+l)
(13)
a
b(r)
(ar)T b(r))
ct
(r)
where r) is the column chosen by the algorithm at the rth iteration. Since la k(r)T b (r) is a maximum, [[A(r)Tb(r) and we can rewrite the above as
[ar)Tb(r)[
b(r+l)
(14)
b(r)
sign
(ar)rb(r)) llA(r)rb(r)llar)
.
where sign() is the function that returns the sign of its argument. Since b (r+l) is orthogonal to
a
r)
IIb(r+)ll IIb(r)l122 IlA(r)rb(r)]l
(15) Rearranging, we get
(16)
(1
2 2
Ilb(r+)
o
IIb
iib
Substituting (12) in (16), we get
(17)
Ilb(r+l)ll22 _<
(1
b(r) 1122 4N (r) llu(r)
ll )
b(r) 112.2
Using (2) in (17), we get
(18)
IIb(r+)ll < (1 l/p)llb(r)ll 22"
Hence, (19) The algorithm halts when
(20)
lib (’) I1 lib (’) 1122
(1
11p)’ lib () I1..
2. For this to happen, it suffices that satisfy (1
11p)’ lib () I1 _< 72.
Rearranging and taking logarithms, we get
(21)
tin(II/p)
b()Il
Incorporating in (21) the following inequality, obtained from a Taylor expansion of the natural logarithm function,
(22)
ln(1
l/p) <
1
P
232
B.K. NATARAJAN
we get that it suffices if
>_ p In
(23)
(II b()ll)
2pln
2
(1 1 _ _)
[3 This completes the proof of the lemma. We will now establish an upper bound on p, by first bounding Ilu (r) 112 and then incorporating a bound on N (r). LEMMA 2. ForO <_ r < t, Ilu (r) 12 < 3/2llA+ll21lb(r) ll2. r) =/= 0} be the set of indices of the nonzero entries in u (r), and let Proof. Let cr {i r kr} be the indices of the first r columns picked by the algorithm. For matrix {kl, k2 A and set of indices/z, we use/z(A) to denote the set of column vectors {ai E #}. We first show that r(A ()) [,,J r(A ()) is a linearly independent set. Assume the contrary, i.e., some linear combination of vectors from o(A ()) [,.,J r(A ()) sum to zero. Not all of the vectors in this combination can be from r(A()), since r(A ()) is a linearly independent set by the definition of the algorithm. So it must be that at least one of the vectors in this combination is from or(A()). It follows that this vector from o(A ()) can be expressed as a linear combination of vectors from a(A ()) [,_J r(A()). Since cr(A (r)) is the projection of o(A () on the subspace orthogonal to r (A(), the above supposition implies that some vector in r(A (r)) can be expressed as a linear combination of vectors in o’(A(r)). But this contradicts the assumption that u (r) has a minimum number of nonzero entries. Hence r(A ()) [,_J r(A ()) must be linearly independent. We can now establish a bound on Ilu (r) 112. By definition,
u
For
Z
q (r)
(24)
u
(r)ar)
Etr, we can write
a
(25)
(rl)
ai
r)
vr)
where v (r) is a vector in the span of r(A(rl)). Using this recurrence, we can express a (rl) in terms of r2) and so on down to ). Since r(A (r)) is orthogonal to r (A(r1)), it follows r) that r) is orthogonal to v (rl) and hence liver) + ItIIui(r1)ll2 Using this, we can combine the recurrences (25) to obtain a single expression of the form
a
v
a{
v}r1) ll [Iu} II
a)
ar)
(26)
/
v
vi
lit)/1122
is a vector in the span of r (A()). Substituting (26) in (24), we get
where
1)
q(r)
(27)
Z
(r)
ui
Noting that l) is a vector in the span of r(A ()) and hence can be expressed as a linear combination of the columns of r (A()), we can rewrite (27) as
(28)
q(r)
Z. /1
U
(r)
Ilvi 1122
aO) + Z (ia) ier
233
SPARSE APPROXIMATE SOLUTIONS TO LINEAR SYSTEMS
for some i in R. Let Z be the matrix with columns o’(A ()) [,_J r(A()), and Z + its pseudoinverse. Also, let w Z +q(r). As established above, the columns of Z are linearly independent, and hence the entries of w are uniquely determined as in (28) above. Specifically, for iEo,
(29)
//)i
]]vi ]122 and for/ E r, wi 3i. Since 1Ilvi]l < 1, we have that fori r, //)i i, /,t(r) 0, and hence lu) < Iwil. We therefore have
6
er, ]u (r) <
Iwil. For
IlU(r) ll2 5 Ilwll2 5 IIZ+ll211q(r)ll2.
(30) Noting that q(r)
b(r)
+ 72, and that since r < t, lib (r) 112 > , we can simplify the above as
IlU(r) l[2
(31)
IIZ/II211 b(r) + 2112
3/211Z/l1211b(r)l12.
A, the Since the columns of Z are a linearly independent subset of the columns of A ) smallest nonzero singular value of Z is greater than or equal to the smallest nonzero singular value of A; see Golub and Van Loan (1983, p. 286). Recall that A + is the pseudoinverse of A. Since Z+ 112 and [IA + 112 are the reciprocals of the smallest nonzero singular values of Z [3 and A, respectively, Z+ 112 < IIA + 112, and the proof of the lemma is complete. LEMMA 3. The number of entries in the sparse solution is nonincreasing as the algorithm iterates, i.e., N +) < N ) < N (m. Proof As before let cr {i ui(r) 0} be the set of indices of the nonzero entries in N (r). Suppose that the algorithm selects a vector from r at the u (), By definition, Irl rth iteration. Then surely u (r+) has one fewer nonzero entry than u (r), namely the entry corresponding to the selected vector. If the vector selected by the algorithm is not from o, then a remains the set of nonzero entries of U (r+l). In either case N (r+) < N (r) and the lemma follows. Noting that N ()
(32)

Opt(/2) we can use Lemma 3 to rewrite (2) as p < 4 max
O
Using Lemma 2 to substitute for
(33)
N() Ilu(r)
lib(r)
2 2
2 2
Ilu (r) 112 in (32) gives us
p _< 9 max
O
N()IIA(+)II 22.
Substituting (33) in the statement of Lemma l, we get that the number of zeros in the computed solution is at most
(34)
[l 80pt(/2),,A+ 1,2 ( lib"2)]. In
This concludes the proof of Theorem 1.
Acknowledgments. Thanks to C. B ischof, R. Carlson, T. Coleman, J. Gilbert, D. S. Johnson, and J. Ruppert for the discussions, to A. Frieze for helping clear an error in an earlier version of the paper, and to the anonymous referees for their comments.
234
B.K. NATARAJAN
REFERENCES R. E. CARLSON AND B. K. NATARAJAN (1994), Sparse approximate multiqaudric interpolation, Comp. and Math. Appl., Vol. 27 (6), pp. 99108. T. COLEMAN AND A. POTHEN (1986), The null space problem I. Complexity, SIAM J. Alg. Disc. Meth., Vol. 7, pp. 527537.
R. G. GALLAGER (1968), Information Theory and Reliable Communication, John Wiley, New York, NY. M. R. GAREY AND D. S. JOHNSON (1979), Computers and Intractability: A Guide to the Theory of NPCompleteness, W. H. Freeman, New York. G. H. GOLUB, V. KLEMA, AND, G. W. STEWART (1976), Rank Degeneracy and Least Squares Problems, Tech. Report, TR456, Dept. of Computer Science, University of Maryland, College Park, MD. G. H. GOLUB AND C. E VAN LOAN (1983), Matrix Computations, Johns Hopkins Press, Baltimore, MD. R. L. HARDY (1990), Theory and applications of the multiquadric biharmonic method, Comput. Math. Appl., Vol. 19, pp. 163208. D. S. JOHNSON (1974), Approximation algorithms for combinatorial problems, J. Comput. System Sci., Vol. 9, pp. 256278. C. A. MICnELLI (1986), Interpolation of scattered data: distance matrices and conditionally positive definite ,systems, Constr. Approx., Vol. 2, pp. 1122. B. K. NATARAJAN (1993), Occam’s razor for functions, Proceedings of the ACM Symposium on Computational Learning Theory, Santa Cruz, CA. E P. PREPARATA AND M. I. SHAMOS (1985), Computational Geometry, An Introduction, SpringerVerlag, New York.