# Overcoming the Challenges of Singular Matrices in Linear Algebra

Written on

In the realm of linear algebra, singular matrices—those with a determinant of zero—pose a significant challenge. For many, encountering such a matrix feels like a dead end. However, recent advancements have provided tools to navigate around this obstacle. Let’s delve into these solutions.

Why is it that a determinant of zero creates such difficulties? The primary reason lies in the fact that we cannot compute the inverse of a singular matrix. This limitation is illustrated by the equivalence known as the Adjoint Method for finding matrix inverses.

The implication of a zero determinant is straightforward: it results in division by zero. This indicates that a matrix with a determinant of zero is classified as a singularity. In mathematical terms, this means that the value of the inverse of such a matrix is either "infinite" or "undefined."

> I previously discussed this concept, highlighting that a zero determinant signifies a loss of information, akin to attempting to retrieve data from nothing. I likened the null space to a black hole in mathematics: once information is lost, it cannot be regained.

## Null Space: The Black Hole of Mathematics

### Why Do Neural Networks Learn?

The connection between the null space and information loss raises questions about learning systems.

**My perspective** is that the inverse of A isn't actually infinite, as one might intuitively think. Rather, it suggests that multiple solutions may exist simultaneously, or potentially no solution at all. The core idea is that the value is not singular, leading to terms like "indeterminate" or "undefined."

### How to Compute the Adjoint of a Matrix

To obtain the adjoint of matrix A, we rearrange A using specific determinant calculations, which you may recognize from the cofactor matrix. Transitioning from the cofactor to the adjoint matrix involves a single step: transposition. The adjoint matrix is simply the cofactor matrix of the transposed matrix.

It's essential to note that the cofactor of the transposed matrix is generally not labeled as coff(A.T); it's typically expressed as the transpose of the adjoint matrix.

The crux of the matter is that the determinant appears in the denominator when calculating the matrix inverse. If it equals zero, we face a division by zero scenario.

So, what causes this singularity? Why do we encounter this "infinity" issue?

## The Intuition Behind the Technical Problem

### Space Domain Perspective

The concept of matrix multiplication can be understood as a transformation of vectors from one vector space (the domain) to another (the codomain), specifically to a subspace of the codomain, known as the range.

While we can contain information in spaces L and L’, using a singular matrix to multiply means information in L’ may be lost.

From another viewpoint, multiplying by a matrix can be seen as projecting onto its row space, while the null space represents the perpendicular dimension, effectively discarding information during the projection process.

### Projection Perspective

Let's explore this analogy further:

**Case 1:**If we know vectors**a**and**v**, we can determine the projection of**a**in the direction of**v**.

However, the situation changes for:

**Case 2:**If we only know the projection of vector**b**onto vector**v**, we cannot ascertain which original vector led to that projection. Numerous vectors could yield the same projection, making it impossible to identify the original.

> Much like Plato's Allegory of the Cave, can we truly discern the original source of the vector/knowledge by merely observing its projections/shadows?

Thus, to successfully apply a matrix transformation (multiply by A) and subsequently reverse it (multiply by the inverse of A), we need assurance that no information has been lost in the null space. We require the projection to be thorough to undo it without any deficit of information.

How do we achieve this? By ensuring we maintain as many linearly independent dimensions as possible during the projection, allowing for the unprojected portion to be preserved in another vector basis.

The determinant serves as a quick reference to indicate whether the row space is capable of storing information (1-to-1 mapping); a zero determinant suggests the presence of a null space.

### Mapping Functions Perspective

This correlates to the definition of functions. When a matrix's determinant is non-zero, it indicates that the function is **bijective**, meaning every vector is guaranteed a 1-to-1 mapping, allowing us to navigate forward and backward without concern.

In contrast, when the determinant is zero, the bijection is lost. For example, both **3** and **4** might map to **C**, but which should be the backward mapping? The mathematical model lacks the ability to resolve this, leading to indeterminacy (division by zero). We can design our model to avoid this indeterminacy, allowing mapping from one vector to many instead.

For instance, using numbers instead of matrices, if we have **a*x=b**, or specifically **2*x=8**, we can divide both sides by **2** to find **x=4**, as **2** has an inverse.

The issue with **det(A)=0** in matrix algebra is that we cannot simplify it similarly when the inverse does not exist, making it akin to **a*x=8** without any way to solve for **x**.

## Three Cases of Matrix Dimensions

When analyzing matrix equations, the focus is primarily on their internal dimensionality (or rank) rather than merely the number of vectors. Specifically, we consider the number of rows (m) and columns (n).

Here are the three cases:

**m=n:**Square matrix, equal rows and columns.**m>n:**Rectangular matrix, more rows than columns (tall matrix).**m<n:**Rectangular matrix, more columns than rows (wide matrix).

To reiterate, we must pay special attention to the number of rows and columns.

Even if there are 5 rows, if 3 of them exist within the span of the others, the effective dimensionality of those 5 vectors is only 2. Using a color analogy, if we have two colors, red and blue, the other 3 vectors may merely be a combination of the two.

For instance, having 10 vectors in a plane means it is two-dimensional (2D) regardless of how many vectors lie within it.

Here, concepts of orthogonality and linear independence play a crucial role, represented by the rank of the matrix and the basis of a vector space.

When 5 vectors lie in a plane, there are only two independent bases to describe that 2D space.

Every basis correlates with an associated eigenvalue. When the determinant of a matrix equals zero, which is equivalent to the product of its eigenvalues, it indicates at least one eigenvalue must be zero, although this is typically overlooked.

This zero eigenvalue situation arises when **m>n** or **m<n**.

With this understanding, we know that only square matrices possess inverses (and non-zero eigenvalues), while non-square or rectangular matrices do not.

For example, a "3x3 matrix" might contain a linearly dependent vector, effectively making it a 3x2 or 2x3 matrix at its core.

It's important to note that this is a personal convention; typically, a 3x3 matrix is referred to as such, even if it contains linearly dependent vectors. The focus here is on dimensionality, as linear dependent vectors do not convey more information than the independent bases.

While exploring real applications is valuable, for clarity, we can initially focus solely on data without delving into abstractions or interpretations. Transitioning from one concept to another is manageable. When solving systems of linear equations, we consider three cases:

### Case 1 (m=n): Well-Constrained Systems

When centered at the origin, a single point can define a line (1D); two points define a plane (2D); and three points can define a volume (3D), provided they are linearly independent.

Similarly, this can be applied to polynomial degrees:

- One point describes a constant function.
- Two points describe a unique line.
- Three points yield a unique quadratic equation.
- Four points yield a cubic function, and so forth.

The need for an additional point correlates to the polynomial degree not being centered at zero, known as the **initial condition f(0)=a**, which can relate to the bias term in machine learning, additional dimensions in homogeneous matrices, or projective dimensions in projective geometric algebra.

> Notice the uniqueness present here—a 1-to-1 mapping is indicative of a bijection, which is expected to yield a non-zero determinant.

To illustrate, consider defining a cubic equation, such as:

f(x) = a + bx + cx² + dx³

In a matrix context, we represent equations in rows and variables in columns.

Now, let's evaluate the cubic polynomial at four points A=(A_x,A_y), B=(B_x,B_y), C=(C_x,C_y), and D=(D_x,D_y):

This results in a 4x4 matrix. If we calculate the determinant, we find...

If we solve using the matrix inverse, this is straightforward.

### Case 2 (m>n): Over-Constrained Systems

What happens when there are more rows than columns? For instance, if we add two more rows to the previous example.

Each row represents an equation; hence adding two rows means adding two equations. In this cubic polynomial example, if we add two additional points, we are now attempting to fit a cubic equation to six points.

Is this feasible? Can a cubic polynomial fit six distinct points? It can, but only if we select redundant points on the curve. What about points in arbitrary locations? That would be impossible.

For clarity, consider a line trying to pass through any four points. It can easily fit two points, but fitting four points in any arrangement? Highly unlikely.

This illustrates why the inverse fails in this scenario. The outcome is undetermined.

We continue to write the matrix equation in hopes of finding a solution.

### Case 3 (m<n): Under-Constrained Systems

Now, we encounter more columns than rows, indicating more parameters than equations—or, in simpler terms, fewer equations than needed. This surplus of parameters leads to a phenomenon known as degrees of freedom.

For instance, for a cubic function, four points are necessary for bijective mapping. If we have fewer than three points, we end up with an under-constrained system.

For a cubic function, perhaps we have two points. For a line requiring two points, what if we only have one?

Can we define a line with just a single point? This time, the answer isn’t simply undetermined. In Case 2, there were no possible answers, but here, we have too many. We could draw countless lines through that single point.

For cubic functions, an infinite number of configurations exists, breaking the bijection mapping.

But there is a solution.

## The Solution

Now that we understand why reversing the process becomes impossible when **det(A)=0**, as we lose information that cannot be retrieved, we have two potential alternatives:

- What if we avoid losing information altogether? By incorporating additional values to retain information, we can bypass this problem.
- Alternatively, we might not worry about the lost information and instead seek the closest approximation, presenting an optimization issue.

This narrative will focus on the latter.

To illustrate, imagine you witnessed a person's face and subsequently saw their “wanted” poster, but your vision was blurred, preventing you from accurately identifying the individual. In an ideal scenario, we want a photorealistic drawing or a clear photograph, but our recollection of the face is imprecise.

The blurred image signifies that you lost information about the exact appearance (similar to the null space).

However, while you might not find someone who matches the blurred image perfectly, you can still identify a person who closely resembles it (perhaps a distinctive mustache suffices). This exemplifies optimization—finding the best fit based on available but incomplete information.

In this scenario, we disregard the lost information by minimizing certain characteristics.

Can we fully recover the information? No, not with 1-to-1 accuracy, but by utilizing what remains, we can narrow down potential matches.

Now, we must interrogate each individual—how do we select one?

By choosing the one whose characteristics most closely align with the alternatives.

## The Pseudo-Inverse

Commonly referred to as the Moore-Penrose pseudo-inverse, this generalized form of the inverse allows us to reverse processes even when the rank of matrix A is such that **det(A)!=0**. In this case, both the left and right pseudo-inverses are identical and equivalent to the inverse we know.

This concept emerged relatively recently, in the 1950s, providing a method to reverse processes using projections over the column or row space. However, it’s crucial to reiterate that we cannot recover the information lost in the null space; instead, we find solutions that best fit based on the information we still possess.

For simplicity within linear algebra, we aim for the closest point to the column space, achieved through orthogonal projection.

### Right or Left Pseudo-Inverse?

Two types of pseudo-inverses exist: left and right. This distinction arises from the necessity of the dimensionality linked to the number of columns versus rows. In over-constrained systems (more rows), the column space is mandatory, while in under-constrained systems (more columns), the row space takes precedence.

What the pseudo-inverse does is project onto the essential space, resulting in an equivalent square matrix.

For example, if there are more rows (m) than columns (n), the dimension linked to n becomes crucial. In a 5x3 matrix, the dimensionality is 3, while the null space remains 2-dimensional.

We aim to transition from the null space to a 3x3 matrix instead of a 5x3.

By projecting the five row vectors into the column space (three-dimensional), those five rows become three.

Why does this method work? Remember that matrix multiplication functions essentially as a projection onto multiple vectors simultaneously—more accurately, it projects onto a new vector space.

Why does this process minimize differences? When we project, we determine the shortest distance from one point (or vector) to another. For example, to find the closest path from a point to a line, the orthogonal projection achieves this.

## Off-Topic Crossover

This discussion relates to:

### What Does it Mean for a Determinant to Equal Zero?

A determinant of zero signifies that a matrix is singular, lacking an inverse.

### Null Space: The Black Hole of Mathematics

Exploring the relationship between the null space and information loss can shed light on why neural networks function effectively.

This narrative serves as an introduction to regression analysis and highlights how neural networks, from a linear algebra perspective, fundamentally revolve around regression—or even better, projections.

While not explicitly mentioned, taller matrices (characterized in over-constrained systems) function primarily as encoders, while wider matrices (in under-constrained systems) act as decoders. This pattern is prevalent in many neural network architectures, depicting encoding-decoding systems.

The pseudo-inverse is also utilized in neural networks, yet it is calculated iteratively through gradient descent instead of the Moore-Penrose method discussed here.

Both approaches tend to yield similar results for optimization problems, although gradient descent employs heuristic methods with additional constraints to mitigate issues like overfitting or custom loss functions. **Note: This does not apply to activation functions.**

Upcoming narratives may address finding line equations for two points, polynomial equations for three points, and generalizations for n points (with non-zero determinants). Topics will include polynomial regression, optimization techniques, and both left and right side linear regression.

## References

[1] Singular Matrix — from Wolfram MathWorld [2] Lesson Explainer: Inverse of a Matrix: The Adjoint Method | Nagwa [3] Bijection, Injection, and Surjection — Wikipedia [4] Visualizing the Four Fundamental Spaces ~ Advanced LAFF [5] The Four Fundamental Subspaces ~ MIT OpenCourseWare