An Introduction to Sparsity Analysis

A comprehensive guide for analyzing sparse systems with applications in machine learning and computer vision

Photo by Thomas Rohlfs in Unsplash

Photo by Thomas Rohlfs in Unsplash

This article aims to delve into the concept of sparsity, a fundamental idea that revolutionized the multimedia industry in the 21st century. We will explore various methods for analyzing systems with sparse representations and discuss specific applications in machine learning and computer vision, where these methods have had a significant impact.

We will begin by understanding how data can be represented in matrix form. Then, we will explore some important definitions and concepts, such as singular value decomposition, which will help us appreciate the intricacies of the algorithms behind these applications.

Keywords: Singular Value Decomposition (SVD), Compressed Sensing, Regression, Recommendation Systems, Econometrics.

Table of Contents

Solving system of linear equations

The general form of matrix equation is:

Ax=bAx = b

Here AA is the system or data matrix, xx is the solution vector, and bb is the output. If we expand each term, we obtain:

[a1a2an][x1xn]=[b1bn]\begin{bmatrix} |&|&\ldots&|\\ a_1&a_2&\ldots&a_n\\ |&|&\ldots&| \end{bmatrix} \begin{bmatrix} x_1\\ \vdots\\ x_n \end{bmatrix} = \begin{bmatrix} b_1\\ \vdots\\ b_n \end{bmatrix}

It is important to note that the output bb can also be represented as a linear combination of column vectors in AA using the elements in vector xx, as shown below:a

[a1]x1+[a2]x2++[an]xn=b\begin{bmatrix} |\\ a_1\\ | \end{bmatrix}x_1 + \begin{bmatrix} |\\ a_2\\ | \end{bmatrix}x_2 + \ldots+\begin{bmatrix} |\\ a_n\\ | \end{bmatrix}x_n = b

This representation provides a more intuitive understanding of matrix-vector multiplication.

If the equation Ax=bAx=b has no solution, which is often the case in real-world, we can find the best approximate solution by minimizing the squares of the difference bAxb-Ax, called the least squares method, as shown below

minxbAx22\min_x ||b - Ax||_2^2

More specifically, the objective is to minimize the sum of squares of errors between AxAx and the output vector bb

minxi=1m(bij=1naijxj)2\min_x \sum_{i=1}^m(b_i - \sum_{j=1}^na_{ij}x_j)^2

A closed-form solution for this model, when A1A^{-1} exists, is provided by the normal equations:

x=(ATA)1ATbx^* = (A^TA)^{-1}A^Tb

However, if AA is underdetermined, i.e., it contains more unknowns than equations, a closed-form solution does not exist. For any under-determined system, the number of columns is exceeds the number of rows, indicating a greater number of features than observations or samples, m<nm< n. Furthermore, these systems often have infinitely many solutions, leading to overfitting of the data and compromising the model's generalization capability.

To address such problems, we restrict the solution space by incorporating prior knowledge about the data through regularization.

Singular Value Decomposition (SVD)

Before delving into regularization, let’s look at Singular Value Decomposition (SVD), a fundamental concept in linear algebra. SVD asserts that any matrix, XX, can be decomposed (or factorized) into a product of three matrices UU, Σ\Sigma, and VTV^T.

X=UΣVT=[u1u2um][σ10σ20σn0][v1v2vn]TX = U\Sigma V^T = \begin{bmatrix} |&|&\ldots&|\\ u_1&u_2&\ldots&u_m\\ |&|&\ldots&| \end{bmatrix} \begin{bmatrix} \sigma_1&&&0 \\ &\sigma_2&& \\ &&\ddots& \\ 0&&&\sigma_n \\ &&0& \end{bmatrix} \begin{bmatrix} |&|&\ldots&|\\ v_1&v_2&\ldots&v_n\\ |&|&\ldots&| \end{bmatrix}^T

Here, both UU and VV are orthogonal matrices and Σ\Sigma is a diagonal matrix.

We refer to the values σ1,σ2,,σn\sigma_1,\sigma_2,\ldots, \sigma_n as singular values, vectors u1,u2,,unu_1,u_2,\ldots,u_n are left singular vectors, and vectors v1,v2,,vnv_1,v_2,\ldots,v_n are right singular vectors.

The singular values are non-negative and are arranged in descending order i.e. σ1σ2σn\sigma_1 \geq \sigma_2 \geq \ldots \geq \sigma_n, indicating the relative importance of each term in the expansion

X=u1σ1v1T+u2σ2v2T+X = u_1\sigma_1v_1^T + u_2\sigma_2v_2^T + \ldots

Each term, in the above sum, has rank 11 since it depends only on one column vector and one row vector (recall for any matrix row rank is equal to column rank). Consequently, the matrix XX can be approximated using the first kk-terms giving the closest kk-dimensional representation of matrix XX.

For instance, the best rank-one approximation of the matrix XX is given by u1σ1v1Tu_1\sigma_1v_1^T.

[u1]σ1[v1]\begin{bmatrix} |\\ u_1\\ | \end{bmatrix} \sigma_1 \begin{bmatrix} -&v_1&- \end{bmatrix}

This concept of decomposing the matrix into a sum of rank 11 matrices can be employed to solve the rank minimization problem, formulated as

minrank(A^)=rAA^F\min_{\text{rank}(\hat{A})=r}||A-\hat{A}||_F

where .F||.||_F is called Frobenius norm.

Essentially, we are trying to find the optimal rank rr approximation, denoted by A^\hat{A}, of matrix AA by minimizing the Frobenius norm between A^\hat{A} and AA.

A Frobenius norm of a matrix is the square root of the sum of its squared entries:

AF=ijA[i,j]2||A||_F = \sqrt{\sum_i\sum_jA[i,j]^2}

The solution to this problem is inherently sparse, and it can be obtained using singular value decomposition, with all but the first rr singular values considered trivial.

Several variants of singular value decomposition exist, including principle component analysis (PCA), Robust PCA, linear discriminant analysis (LDA), and more.

Note: Interestingly, the right singular vectors correspond to the eigenvectors of the matrix XTXX^TX, while the left singular vectors correspond to the eigenvectors of XXTXX^T. The eigenvalues of the symmetric positive semidefinite matrix XTXX^TX are the squares of the corresponding singular values σ1,σ2,,σn\sigma_1,\sigma_2, \ldots,\sigma_n of XX.

Regularization

As mentioned earlier, regularization allows us to restrict the solution space by incorporating prior knowledge into the problem, such as sparsity or smallness of the solution vector xx.

Considering the solutions for the equations Ax=bAx=b as sparse, the system becomes more robust, preventing the model from overfitting.

Regularization can be achieved using the norm of the matrix. A norm is a function defined on a vector space that satisfies specific properties related to scalability, additivity, and assumes a value of zero only when the input vector is zero.

When employing l2l_2 norm for regularization, the problem of finding the optimal solutions is known as Ridge. On the other hand, if the model is regularized using l1l_1 norm the problem of finding the optimal solutions is called Lasso.

Among the norms related to the regularization of the model, only l1l_1 norm and lpl_p norm (where 0<p<10<p<1) have been proven to yield sparse solutions. Nevertheless, we shall discuss these norms in the upcoming sections.

l2l_2-Norm

The l2l_2 norm, also known as the Euclidean norm, of a vector xx is defined as:

x2=i=1nxi2||x||_2 = \sqrt{\sum_{i=1}^{n}x_i^2}

As we mentioned earlier, we are trying to find the optimal solutions for the least-squares problem:

minxAxb22\min_x ||Ax-b||_2^2

By introducing l2l_2 norm regularization to the equation Ax=bAx=b, we get

minx2s.t.Ax=b\begin{aligned} \min \quad & ||x||_2\\ \textrm{s.t.} \quad & Ax=b \\ \end{aligned}

The solution space in 2D is shown below

/blog/sparsity/l2norm.svg

While the l2l_2 norm can effectively regularize the model, it does not yield sparse solutions.

Alternatively, we can swap the constraint and objective and rewrite the problem as:

minxAxb22s.t.x2S\begin{aligned} \min_{x} \quad & ||Ax-b||^2_2\\ \textrm{s.t.} \quad & ||x||_2\leq S\\ \end{aligned}

This can be reformulated into the Lagrangian form as shown below

minxAxb22+λx2\min_x ||Ax-b||_2^2 + \lambda||x||_2

Notice the extra term λx2\lambda ||x||_2 contributes to the regularization of the model.

This way of regularizing the model using a l2l_2 norm is known as the Ridge problem, where the hyperparameter λ\lambda controls the amount of regularization. It is worth noting that both λ\lambda and SS constrain the values of xx, and their optimal values depend on the data.

l0l_0 Norm

The l0l_0 norm of a vector xx is given by the number of all non-zero entries of xx.

Using l0l_0 norm regularization, the problem becomes

minx0s.t.Ax=b\begin{aligned} \min \quad & ||x||_0\\ \textrm{s.t.} \quad & Ax=b \\ \end{aligned}

The solution space in 2D is shown below:

/blog/sparsity/l0norm.svg

Although l0l_0 norm regularization provides sparse solutions, they are not unique. Furthermore, solving this problem is NP-hard due to the non-convex nature of the l0l_0 norm. However, certain greedy algorithms such as Matching Pursuit can be employed to tackle such problems.

l1l_1-Norm

The l1l_1 norm of a vector xx is given by

x1=i=1nxi||x||_1 = \sum_{i=1}^{n}|x_i|

By incorporating the l1l_1 norm into the problem, we can obtain sparse solutions by effectively eliminating the impact of less important features. For instance, if we have:

[a1]x1+[a2]0+[a3]0+[a4]xn=b\begin{bmatrix} |\\ a_1\\ | \end{bmatrix}x_1 + \begin{bmatrix} |\\ a_2\\ | \end{bmatrix}0 +\begin{bmatrix} |\\ a_3\\ | \end{bmatrix}0 +\begin{bmatrix} |\\ a_4\\ | \end{bmatrix}x_n = b

Then the solutions are simply [x100x4]\begin{bmatrix}x_1\\0\\0\\x_4\end{bmatrix}

Therefore, the problem to solve becomes

minx1s.t.Ax=b\begin{aligned} \min \quad & ||x||_1\\ \textrm{s.t.} \quad & Ax=b \\ \end{aligned}

The solution space in 2D is shown below

/blog/sparsity/l1norm.svg

Similar to Ridge, we can swap the constraint and objective and rewrite the problem as:

minxAxb22s.t.x1S\begin{aligned} \min_{x} \quad & ||Ax-b||^2_2\\ \textrm{s.t.} \quad & ||x||_1\leq S\\ \end{aligned}

This is called Lasso Problem, where the constraint SS limits the values of xx, forcing the optimization algorithm to produce sparse solutions.

The same problem can be reformulated into the Lagrangian form as shown below

minxAxb22+λx1\min_x ||Ax-b||_2^2 + \lambda||x||_1

Though the Lasso yields sparse solutions, unlike Ridge, there are no closed-form solutions. However, various methods, including Matching Pursuit, Orthogonal Matching Pursuit, Fast Iterative Shrinkage Algorithm (FISTA), and Alternative Direction Method of Multipliers, can be employed to solve the optimization problem.

Note: Sometimes the Lasso problem is also called a Basis Pursuit problem. Essentially both are the same and can be used interchangeably. However, the term Lasso is more popular in the mathematics community.

lpl_p-Norm

The generalized lpl_p-norm for a vector xx is given by

xp=(i=1Nxip)1/p||x||_p = (\sum_{i=1}^N |x_i|^p)^{1/p}

where 0<p<10<p<1.

By incorporating the lpl_p norm into the given problem:

minxps.t.Ax=b\begin{aligned} \min \quad & ||x||_p\\ \textrm{s.t.} \quad & Ax=b \\ \end{aligned}

The solution space in 2D is shown below

/blog/sparsity/lpnorm.svg

By swapping the constraint and objective we can rewrite the problem as:

minxAxb22s.t.xpS\begin{aligned} \min_{x} \quad & ||Ax-b||^2_2\\ \textrm{s.t.} \quad & ||x||_p\leq S\\ \end{aligned}

This formulation can also be expressed in the Lagrangian form:

minxAxb22+λxp\min_x ||Ax-b||_2^2 + \lambda||x||_p

While the lpl_p norm regularization allows us to obtain sparse solutions, it poses challenges due to the non-convex nature of the lpl_p norm, making the problem NP-hard.

All the regularization norms are summarized in the figure below

/blog/sparsity/all_norms.svg

Among these norms, the l1l_1 norm is often considered the preferred choice for regularization due to its ability to induce sparsity in the solutions.

Applications

In this section, we will explore several applications where sparsity plays a significant role. Although the problem formulation remains largely unchanged in these applications, the approaches for finding solutions may differ.

Econometrics

Econometrics involves the prediction of quantities related to the economy of businesses or other entities. It encompasses testing and evaluating economic theories and hypotheses that aid governments and businesses in formulating appropriate policies.

Regression is widely employed in econometrics. It is used to establish correlations between a dependent variable (yy) and a set of independent variables (xix_i). In linear regression, the dependent variable exhibits a direct relationship with the independent variables, as shown by the equation below:

y=α1x1+α2x2++αnxny = \alpha_1x_1 + \alpha_2x_2+\ldots + \alpha_nx_n

For instance, if yy represents the wage received by an individual working in a company, then x1x_1 could represent the amount of training, x2x_2 could denote the years of experience, and so on.

Note that in the above equation, the weights assigned to each independent variable xix_i is denoted by α1,α2,,αn\alpha_1, \alpha_2,\ldots,\alpha_n and x1,x2,,xnx_1, x_2,\ldots,x_n do not correspond to the values of vector xx in Ax=bAx=b. Instead, they denote the values of each row in matrix AA, while α1,α2,...αn\alpha_1, \alpha_2,...\alpha_n represent the values of xx in Ax=bAx=b.

We know the general form of matrix equation is

Ax=bAx = b

This equation can also be formulated as a regression problem, where the output bb is a linear combination of elements in matrix AA. Specifically, it involves performing matrix-vector multiplication between the elements of AA and the vector xx, as shown below:

b1=a11x1+a12x2+a13x3+b_1 = a_{11}x_1+a_{12}x_2+a_{13}x_3 + \ldots b2=a21x1+a22x2+a23x3+b_2 = a_{21}x_1+a_{22}x_2+a_{23}x_3 + \ldots

Each element xx, also referred to as weight, indicates the extent of contribution of a corresponding feature to the final output. Consequently, the objective of such problems revolves around determing the optimal values of xx, denoted as xx^*.

Note: In linear algebra, the same problem can be stated as finding the coefficients xx^* that yield a vector in Spana1,a2,,an\text{Span}\lang a_1,a_2,\ldots, a_n\rang which is closest to bb.

/blog/sparsity/regression.svg

The closed-form solution is given by:

x=(ATA)1ATbx^* = (A^TA)^{-1}A^Tb

X+=(XTX)1XTX^+ = (X^TX)^{-1}X^T is the pseudoinverse or generalized inverse (more specifically Moore-Penrose inverse) of XX.

Note: The computation of the pseudoinverse itself relies on Singular Value Decomposition (SVD).

The general assumption across all the regression algorithms is that these feature vectors aia_i are independent of each other, i.e., {ai}\{a_i\} forms an orthonormal basis.

a1a2a3a_1 \perp a_2 \perp a_3 \perp \ldots

Implementation

from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, y)
lin_reg.intercept_, lin_reg.coef_

We solved the regression problem by finding the optimal values xx*. However, if the system is under-determined, meaning the dataset is relatively small, we can leverage different regularization techniques discussed earlier to mitigate the risk of overfitting.

Netflix Movie Challenge

Netflix launched this challenge in 2006 to improve move recommendations for its customers. SVD played a central role in many of the top-performing algorithms.

/blog/sparsity/netflix.jpg

The provided data in this challenge is largely incomplete, with several empty entries denoting missing ratings.

A=[a11a12a1na21a2nam2amn]A = \begin{bmatrix} a_{11}&a_{12}&-&\ldots&a_{1n}\\ a_{21}&-&-&\ldots&a_{2n}\\ \vdots&\vdots&\vdots&\vdots&| \\ -&a_{m2}&-&\ldots&a_{mn} \end{bmatrix}

Each entry corresponds to a customer’s rating for a specific movie, and each column contains ratings from different customers.

The main goal was to recover the complete matrix from the limited observations by finding the best approximation of ratings for the missing entries. These ratings can then be used to recommend movies to other customers who exhibit similar viewing patterns.

Thus, we essentially aim to obtain an approximation A^\hat{A} that imputes all the missing entries in AA, a problem known as matrix completion.

One way to solving this problem is by constraining the rank of AA. Hence, the optimization problem to be solved can be formulated as follows:

minrank(A^)rAA^F\min_{\text{rank}(\hat{A})\leq r}||A-\hat{A}||_F

or equivalently,

minA^AA^Fsubject to  rank(A^)r\min_{\hat{A}}||A-\hat{A}||_F \\ \text{subject to}\;\text{rank}(\hat{A}) \leq r

This problem is non-convex and requires some heuristic knowledge to identify local minima. We start with an initial guess for missing values and then iteratively minimize the rank until convergence.

Another approach involves utilizing the nuclear norm, for which exact solutions are can be derived. The nuclear norm of a matrix is simply refers to the sum of its singular values and is also known as the trace norm.

We can define a projection operator, denoted by PP, over subset Ω\Omega of observed entries, that replaces the missing entries with zeros while leaving the observed entries unchanged:

P(A)ij={aijif (i,j)Ω0if (i,j)Ω|P(A)|_{ij} = \begin{cases} a_{ij} & \text{if $(i,j)\in \Omega$} \\ 0 & \text{if $(i,j)\notin \Omega$} \\ \end{cases}

Therefore, the optimization problem becomes

minA^P(A)P(A^)F+μA^\boxed{\min_{\hat{A}}||P(A)-P(\hat{A})||_F + \mu||\hat{A}||_*}

where .||.||_* is the nuclear norm

We again start with an initial guess for the missing entries, compute the full rank of the resultant matrix, and then threshold its singular values (using SVD) to obtain new estimates for the missing entries. This process is repeated until the problem converges.

Note: In practice, we allow for some errors in A^\hat{A} to prevent the model from overfitting.

Similar applications of recommendation systems can be found in Amazon.com, Target, Walmart, and others.

Video Surveillance

In video surveillance, each input frame BB of a video can be divided into background LL (stationary part) and foreground EE (moving parts).

B=L+EB = L+E

Since the background is essentially the same across all input frames, it can be represented by a low-rank matrix, which corresponds to a rank minimization problem..

On the other hand, the moving parts only occupy a small portion of the input frame and can be effectively represented by a sparse matrix, involving a regularization problem.

Hence, the resultant problem can be formulated as follows:

minL,EBLEF2+λE1subject to  rank(L)k\min_{L,E}||B-L-E||_F^2 + \lambda||E_1|| \\ \text{subject to}\;\text{rank}(L) \leq k

This problem can be solved using singular value decomposition (SVD), where the background matrix LL is decomposed into the product UΣVTU\Sigma V^T. Since the rank of LL is constrained to kk, we are only interested in the first kk singular values. Thus, we can reformulate the problem as:

minL,EBLEF2+λE1subject to  Σ0k\min_{L,E}||B-L-E||_F^2 + \lambda||E_1|| \\ \text{subject to}\;||\Sigma||_0 \leq k

However, this problem is known to be NP-hard, necessitating the introduction of the nuclear norm. Consequently, the resulting problem becomes:

minL,EBLEF2+λE1+μL\boxed{\min_{L,E}||B-L-E||_F^2 + \lambda||E_1||+ \mu||L||_*}

where .||.||_* is the nuclear norm

Note: In the singular value decomposition of matrix AA, denoted by A=UΣVTA=U\Sigma V^{T}, let σ\mathbf{\sigma} represent the vector containing singular values of AA. Then:

σ0\left\Vert \mathbf{\sigma}\right\Vert_0 equals the rank of AA

σ1\left\Vert \mathbf{\sigma}\right\Vert_1 equals the nuclear norm of AA

σ2\left\Vert \mathbf{\sigma}\right\Vert_2 equals the Frobenius norm of AA

Compressed Sensing

Compressed sensing, also known as compressive sensing, compressive sampling, or sparse sampling, is a technique that enables efficient acquisition and reconstruction of a signal by solving under-determined linear systems.

The flowchart below illustrates a typical image acquisition process:

/blog/sparsity/compressive_sensing.svg

Compressed sensing is based on the assumption that the image can be represented sparsely by sampling at a frequency lower than the Nyquist frequency. The image is expected to have some redundancies, as it is impossible to compress any image without any redundancies, such as pure noise. These redundancies arise due to correlations inherent in the data.

/blog/sparsity/compressive_sensing2.svg

The goal of any compression algorithm is to reduce or eliminate these correlations present in the data, while identifying the importance of each, and if feasible, reduce certain correlations to zero.

x=arg minxAxb22+λx1x^* = \argmin_x || Ax - b||_2^2 + \lambda||x||_1

By sampling the image at a frequency (MM) lower than the Nyquist frequency (NN), we compress the image. This sparse representation of the image can be easily transmitted or stored, and when required, it can be reconstructed to approximate the quality of the original image. However, some perceptual redundancies or loss may be present, as compression inherently involves trade-offs. The recovered signal can be obtained through AxAx^*.

Consider the following example:

Original Image

/blog/sparsity/gray_image.png

Compressed Image

/blog/sparsity/sparse.png

The original image undergoes compression, resulting in a sparse representation where certain pixels' intensity values are reduced to zero. Although the resultant compressed image is sparse, this method is somewhat naive as it inadvertently discards too many details.

The result of a more effective compression is shown below:

/blog/sparsity/compressed.png

The concept of sparsity finds applications in various other domains, including neuroscience and medical imaging, making it crucial to understand and explore its potential.

References

  1. Statistical Learning with Sparsity: The Lasso and Generalizations
  2. Machine Learning A Bayesian and Optimization Perspective: Chapters 9 and 10
  3. The Netflix Recommender System: Algorithms, Business Value, and Innovation Carlos A. Gomez-Uribe and Neil Hunt, Netflix, Inc.

computer vision

machine learning

image processing