This article aims to delve into the concept of sparsity, a fundamental idea that revolutionized the multimedia industry in the 21st century. We will explore various methods for analyzing systems with sparse representations and discuss specific applications in machine learning and computer vision, where these methods have had a significant impact.
We will begin by understanding how data can be represented in matrix form. Then, we will explore some important definitions and concepts, such as singular value decomposition, which will help us appreciate the intricacies of the algorithms behind these applications.
Keywords: Singular Value Decomposition (SVD), Compressed Sensing, Regression, Recommendation Systems, Econometrics.
Table of Contents
- Solving system of linear equations
- Singular Value Decomposition (SVD)
- Regularization
- Applications
- References
Solving system of linear equations
The general form of matrix equation is:
Here is the system or data matrix, is the solution vector, and is the output. If we expand each term, we obtain:
It is important to note that the output can also be represented as a linear combination of column vectors in using the elements in vector , as shown below:a
This representation provides a more intuitive understanding of matrix-vector multiplication.
If the equation has no solution, which is often the case in real-world, we can find the best approximate solution by minimizing the squares of the difference , called the least squares method, as shown below
More specifically, the objective is to minimize the sum of squares of errors between and the output vector
A closed-form solution for this model, when exists, is provided by the normal equations:
However, if is underdetermined, i.e., it contains more unknowns than equations, a closed-form solution does not exist. For any under-determined system, the number of columns is exceeds the number of rows, indicating a greater number of features than observations or samples, . Furthermore, these systems often have infinitely many solutions, leading to overfitting of the data and compromising the model's generalization capability.
To address such problems, we restrict the solution space by incorporating prior knowledge about the data through regularization.
Singular Value Decomposition (SVD)
Before delving into regularization, let’s look at Singular Value Decomposition (SVD), a fundamental concept in linear algebra. SVD asserts that any matrix, , can be decomposed (or factorized) into a product of three matrices , , and .
Here, both and are orthogonal matrices and is a diagonal matrix.
We refer to the values as singular values, vectors are left singular vectors, and vectors are right singular vectors.
The singular values are non-negative and are arranged in descending order i.e. , indicating the relative importance of each term in the expansion
Each term, in the above sum, has rank since it depends only on one column vector and one row vector (recall for any matrix row rank is equal to column rank). Consequently, the matrix can be approximated using the first -terms giving the closest -dimensional representation of matrix .
For instance, the best rank-one approximation of the matrix is given by .
This concept of decomposing the matrix into a sum of rank matrices can be employed to solve the rank minimization problem, formulated as
where is called Frobenius norm.
Essentially, we are trying to find the optimal rank approximation, denoted by , of matrix by minimizing the Frobenius norm between and .
A Frobenius norm of a matrix is the square root of the sum of its squared entries:
The solution to this problem is inherently sparse, and it can be obtained using singular value decomposition, with all but the first singular values considered trivial.
Several variants of singular value decomposition exist, including principle component analysis (PCA), Robust PCA, linear discriminant analysis (LDA), and more.
Note: Interestingly, the right singular vectors correspond to the eigenvectors of the matrix , while the left singular vectors correspond to the eigenvectors of . The eigenvalues of the symmetric positive semidefinite matrix are the squares of the corresponding singular values of .
Regularization
As mentioned earlier, regularization allows us to restrict the solution space by incorporating prior knowledge into the problem, such as sparsity or smallness of the solution vector .
Considering the solutions for the equations as sparse, the system becomes more robust, preventing the model from overfitting.
Regularization can be achieved using the norm of the matrix. A norm is a function defined on a vector space that satisfies specific properties related to scalability, additivity, and assumes a value of zero only when the input vector is zero.
When employing norm for regularization, the problem of finding the optimal solutions is known as Ridge. On the other hand, if the model is regularized using norm the problem of finding the optimal solutions is called Lasso.
Among the norms related to the regularization of the model, only norm and norm (where ) have been proven to yield sparse solutions. Nevertheless, we shall discuss these norms in the upcoming sections.
-Norm
The norm, also known as the Euclidean norm, of a vector is defined as:
As we mentioned earlier, we are trying to find the optimal solutions for the least-squares problem:
By introducing norm regularization to the equation , we get
The solution space in 2D is shown below
While the norm can effectively regularize the model, it does not yield sparse solutions.
Alternatively, we can swap the constraint and objective and rewrite the problem as:
This can be reformulated into the Lagrangian form as shown below
Notice the extra term contributes to the regularization of the model.
This way of regularizing the model using a norm is known as the Ridge problem, where the hyperparameter controls the amount of regularization. It is worth noting that both and constrain the values of , and their optimal values depend on the data.
Norm
The norm of a vector is given by the number of all non-zero entries of .
Using norm regularization, the problem becomes
The solution space in 2D is shown below:
Although norm regularization provides sparse solutions, they are not unique. Furthermore, solving this problem is NP-hard due to the non-convex nature of the norm. However, certain greedy algorithms such as Matching Pursuit can be employed to tackle such problems.
-Norm
The norm of a vector is given by
By incorporating the norm into the problem, we can obtain sparse solutions by effectively eliminating the impact of less important features. For instance, if we have:
Then the solutions are simply
Therefore, the problem to solve becomes
The solution space in 2D is shown below
Similar to Ridge, we can swap the constraint and objective and rewrite the problem as:
This is called Lasso Problem, where the constraint limits the values of , forcing the optimization algorithm to produce sparse solutions.
The same problem can be reformulated into the Lagrangian form as shown below
Though the Lasso yields sparse solutions, unlike Ridge, there are no closed-form solutions. However, various methods, including Matching Pursuit, Orthogonal Matching Pursuit, Fast Iterative Shrinkage Algorithm (FISTA), and Alternative Direction Method of Multipliers, can be employed to solve the optimization problem.
Note: Sometimes the Lasso problem is also called a Basis Pursuit problem. Essentially both are the same and can be used interchangeably. However, the term Lasso is more popular in the mathematics community.
-Norm
The generalized -norm for a vector is given by
where .
By incorporating the norm into the given problem:
The solution space in 2D is shown below
By swapping the constraint and objective we can rewrite the problem as:
This formulation can also be expressed in the Lagrangian form:
While the norm regularization allows us to obtain sparse solutions, it poses challenges due to the non-convex nature of the norm, making the problem NP-hard.
All the regularization norms are summarized in the figure below
Among these norms, the norm is often considered the preferred choice for regularization due to its ability to induce sparsity in the solutions.
Applications
In this section, we will explore several applications where sparsity plays a significant role. Although the problem formulation remains largely unchanged in these applications, the approaches for finding solutions may differ.
Econometrics
Econometrics involves the prediction of quantities related to the economy of businesses or other entities. It encompasses testing and evaluating economic theories and hypotheses that aid governments and businesses in formulating appropriate policies.
Regression is widely employed in econometrics. It is used to establish correlations between a dependent variable () and a set of independent variables (). In linear regression, the dependent variable exhibits a direct relationship with the independent variables, as shown by the equation below:
For instance, if represents the wage received by an individual working in a company, then could represent the amount of training, could denote the years of experience, and so on.
Note that in the above equation, the weights assigned to each independent variable is denoted by and do not correspond to the values of vector in . Instead, they denote the values of each row in matrix , while represent the values of in .
We know the general form of matrix equation is
This equation can also be formulated as a regression problem, where the output is a linear combination of elements in matrix . Specifically, it involves performing matrix-vector multiplication between the elements of and the vector , as shown below:
Each element , also referred to as weight, indicates the extent of contribution of a corresponding feature to the final output. Consequently, the objective of such problems revolves around determing the optimal values of , denoted as .
Note: In linear algebra, the same problem can be stated as finding the coefficients that yield a vector in which is closest to .
The closed-form solution is given by:
is the pseudoinverse or generalized inverse (more specifically Moore-Penrose inverse) of .
Note: The computation of the pseudoinverse itself relies on Singular Value Decomposition (SVD).
The general assumption across all the regression algorithms is that these feature vectors are independent of each other, i.e., forms an orthonormal basis.
Implementation
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, y)
lin_reg.intercept_, lin_reg.coef_
We solved the regression problem by finding the optimal values *. However, if the system is under-determined, meaning the dataset is relatively small, we can leverage different regularization techniques discussed earlier to mitigate the risk of overfitting.
Netflix Movie Challenge
Netflix launched this challenge in 2006 to improve move recommendations for its customers. SVD played a central role in many of the top-performing algorithms.

The provided data in this challenge is largely incomplete, with several empty entries denoting missing ratings.
Each entry corresponds to a customer’s rating for a specific movie, and each column contains ratings from different customers.
The main goal was to recover the complete matrix from the limited observations by finding the best approximation of ratings for the missing entries. These ratings can then be used to recommend movies to other customers who exhibit similar viewing patterns.
Thus, we essentially aim to obtain an approximation that imputes all the missing entries in , a problem known as matrix completion.
One way to solving this problem is by constraining the rank of . Hence, the optimization problem to be solved can be formulated as follows:
or equivalently,
This problem is non-convex and requires some heuristic knowledge to identify local minima. We start with an initial guess for missing values and then iteratively minimize the rank until convergence.
Another approach involves utilizing the nuclear norm, for which exact solutions are can be derived. The nuclear norm of a matrix is simply refers to the sum of its singular values and is also known as the trace norm.
We can define a projection operator, denoted by , over subset of observed entries, that replaces the missing entries with zeros while leaving the observed entries unchanged:
Therefore, the optimization problem becomes
where is the nuclear norm
We again start with an initial guess for the missing entries, compute the full rank of the resultant matrix, and then threshold its singular values (using SVD) to obtain new estimates for the missing entries. This process is repeated until the problem converges.
Note: In practice, we allow for some errors in to prevent the model from overfitting.
Similar applications of recommendation systems can be found in Amazon.com, Target, Walmart, and others.
Video Surveillance
In video surveillance, each input frame of a video can be divided into background (stationary part) and foreground (moving parts).
Since the background is essentially the same across all input frames, it can be represented by a low-rank matrix, which corresponds to a rank minimization problem..
On the other hand, the moving parts only occupy a small portion of the input frame and can be effectively represented by a sparse matrix, involving a regularization problem.
Hence, the resultant problem can be formulated as follows:
This problem can be solved using singular value decomposition (SVD), where the background matrix is decomposed into the product . Since the rank of is constrained to , we are only interested in the first singular values. Thus, we can reformulate the problem as:
However, this problem is known to be NP-hard, necessitating the introduction of the nuclear norm. Consequently, the resulting problem becomes:
where is the nuclear norm
Note: In the singular value decomposition of matrix , denoted by , let represent the vector containing singular values of . Then:
equals the rank of
equals the nuclear norm of
equals the Frobenius norm of
Compressed Sensing
Compressed sensing, also known as compressive sensing, compressive sampling, or sparse sampling, is a technique that enables efficient acquisition and reconstruction of a signal by solving under-determined linear systems.
The flowchart below illustrates a typical image acquisition process:
Compressed sensing is based on the assumption that the image can be represented sparsely by sampling at a frequency lower than the Nyquist frequency. The image is expected to have some redundancies, as it is impossible to compress any image without any redundancies, such as pure noise. These redundancies arise due to correlations inherent in the data.
The goal of any compression algorithm is to reduce or eliminate these correlations present in the data, while identifying the importance of each, and if feasible, reduce certain correlations to zero.
By sampling the image at a frequency () lower than the Nyquist frequency (), we compress the image. This sparse representation of the image can be easily transmitted or stored, and when required, it can be reconstructed to approximate the quality of the original image. However, some perceptual redundancies or loss may be present, as compression inherently involves trade-offs. The recovered signal can be obtained through .
Consider the following example:
Original Image

Compressed Image

The original image undergoes compression, resulting in a sparse representation where certain pixels' intensity values are reduced to zero. Although the resultant compressed image is sparse, this method is somewhat naive as it inadvertently discards too many details.
The result of a more effective compression is shown below:

The concept of sparsity finds applications in various other domains, including neuroscience and medical imaging, making it crucial to understand and explore its potential.
References
- Statistical Learning with Sparsity: The Lasso and Generalizations
- Machine Learning A Bayesian and Optimization Perspective: Chapters 9 and 10
- The Netflix Recommender System: Algorithms, Business Value, and Innovation Carlos A. Gomez-Uribe and Neil Hunt, Netflix, Inc.