Least Squares and Regression

The Least Squares method is a fundamental technique in both linear algebra and statistics, widely used for solving over-determined systems and performing regression analysis. This article explores the mathematical foundation of the Least Squares method, its application in regression, and how matrix algebra is used to fit models to data.

1. Introduction to Least Squares

1.1 What is the Least Squares Method?

The Least Squares method is a mathematical procedure used to find the best-fitting solution to a system of linear equations that may not have an exact solution. It does this by minimizing the sum of the squared differences (residuals) between the observed values and the values predicted by the model.

1.2 Why Use Least Squares?

Over-Determined Systems: In many real-world problems, we encounter systems where there are more equations than unknowns. The Least Squares method provides a way to find an approximate solution that best fits all the given equations.
Regression Analysis: Least Squares is the cornerstone of regression analysis, where the goal is to fit a model to data by minimizing the prediction errors.

2. Least Squares in Linear Algebra

2.1 The Mathematical Formulation

Given an over-determined system $\mathbf{A}\mathbf{x} = \mathbf{b}$ , where $\mathbf{A}$ is an $m \times n$ matrix with $m > n$ , the Least Squares solution $\mathbf{\hat{x}}$ minimizes the squared error:

\|\mathbf{A}\mathbf{x} - \mathbf{b}\|_2^2

This minimization leads to the Normal Equations:

\mathbf{A}^\top \mathbf{A} \mathbf{\hat{x}} = \mathbf{A}^\top \mathbf{b}

2.2 Solving the Normal Equations

To find the Least Squares solution:

Compute $\mathbf{A}^\top \mathbf{A}$ and $\mathbf{A}^\top \mathbf{b}$ .
Solve the resulting system of equations to find $\mathbf{\hat{x}}$ .

Example: Consider the over-determined system:

\mathbf{A} = \begin{pmatrix} 2 & 1 \\ 3 & -1 \\ 1 & 2 \end{pmatrix}, \quad \mathbf{b} = \begin{pmatrix} 5 \\ 3 \\ 4 \end{pmatrix}

To solve using Least Squares:

Compute $\mathbf{A}^\top \mathbf{A}$ and $\mathbf{A}^\top \mathbf{b}$ :

\mathbf{A}^\top \mathbf{A} = \begin{pmatrix} 14 & 4 \\ 4 & 6 \end{pmatrix}, \quad \mathbf{A}^\top \mathbf{b} = \begin{pmatrix} 23 \\ 12 \end{pmatrix}

Solve the normal equations $\mathbf{A}^\top \mathbf{A} \mathbf{\hat{x}} = \mathbf{A}^\top \mathbf{b}$ :

\begin{pmatrix} 14 & 4 \\ 4 & 6 \end{pmatrix} \begin{pmatrix} \hat{x}_1 \\ \hat{x}_2 \end{pmatrix} = \begin{pmatrix} 23 \\ 12 \end{pmatrix}

The solution is:

\mathbf{\hat{x}} = \begin{pmatrix} 1.5 \\ 1.5 \end{pmatrix}

2.3 Connection to Matrix Decompositions

QR Decomposition: The Least Squares solution can also be found using QR decomposition, where $\mathbf{A}$ is decomposed into $\mathbf{Q}\mathbf{R}$ , and the system reduces to solving $\mathbf{R}\mathbf{x} = \mathbf{Q}^\top \mathbf{b}$ .
Singular Value Decomposition (SVD): SVD provides a robust method for solving the Least Squares problem, especially when $\mathbf{A}$ is ill-conditioned.

3. Regression Analysis and Least Squares

3.1 What is Regression Analysis?

Regression Analysis is a statistical technique used to model the relationship between a dependent variable (output) and one or more independent variables (inputs). The goal is to find the best-fitting line (or hyperplane in higher dimensions) that predicts the output based on the inputs.

3.2 Linear Regression

Linear Regression is the simplest form of regression, where the relationship between the dependent variable $y$ and the independent variables $\mathbf{x}$ is modeled as a linear function:

y = \mathbf{x}^\top \mathbf{\beta} + \epsilon

Here:

$\mathbf{\beta}$ represents the coefficients of the linear model.
$\epsilon$ is the error term.

3.3 Fitting the Model Using Least Squares

The coefficients $\mathbf{\beta}$ are estimated using the Least Squares method by minimizing the sum of squared errors between the observed values and the values predicted by the model:

\hat{\mathbf{\beta}} = \text{argmin}_{\mathbf{\beta}} \sum_{i=1}^n \left(y_i - \mathbf{x}_i^\top \mathbf{\beta}\right)^2

This leads to the Normal Equations:

\mathbf{X}^\top \mathbf{X} \hat{\mathbf{\beta}} = \mathbf{X}^\top \mathbf{y}

Where $\mathbf{X}$ is the design matrix containing the independent variables, and $\mathbf{y}$ is the vector of observed values.

Example: Suppose we have data points $(x_i, y_i)$ :

\mathbf{X} = \begin{pmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix}, \quad \mathbf{y} = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix}

The Least Squares solution gives us the coefficients $\hat{\mathbf{\beta}} = \begin{pmatrix} \hat{\beta}_0 \\ \hat{\beta}_1 \end{pmatrix}$ that best fit the linear model $y = \beta_0 + \beta_1 x$ to the data.

3.4 Example: Simple Linear Regression

Consider a dataset with observations:

\begin{aligned} (1, 2), \quad (2, 3), \quad (3, 5), \quad (4, 7) \end{aligned}

We want to fit a line $y = \beta_0 + \beta_1 x$ to this data.

Construct the design matrix $\mathbf{X}$ and vector $\mathbf{y}$ :

\mathbf{X} = \begin{pmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \end{pmatrix}, \quad \mathbf{y} = \begin{pmatrix} 2 \\ 3 \\ 5 \\ 7 \end{pmatrix}

Solve the normal equations:

\hat{\mathbf{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}

After computation, we find:

\hat{\mathbf{\beta}} = \begin{pmatrix} 1 \\ 1.5 \end{pmatrix}

So the best-fit line is $y = 1 + 1.5x$ .

4. Regularization in Least Squares Regression

4.1 Ridge Regression (L2 Regularization)

In Ridge Regression, a penalty proportional to the L2 norm of the coefficients is added to the cost function to prevent overfitting:

\hat{\mathbf{\beta}} = \text{argmin}_{\mathbf{\beta}} \|\mathbf{X}\mathbf{\beta} - \mathbf{y}\|_2^2 + \lambda \|\mathbf{\beta}\|_2^2

Where $\lambda$ is the regularization parameter.

4.2 Lasso Regression (L1 Regularization)

In Lasso Regression, the L1 norm is used as the penalty, which encourages sparsity in the coefficients:

\hat{\mathbf{\beta}} = \text{argmin}_{\mathbf{\beta}} \|\mathbf{X}\mathbf{\beta} - \mathbf{y}\|_2^2 + \lambda \|\mathbf{\beta}\|_1

Lasso regression is particularly useful when dealing with high-dimensional data, as it tends to produce models with fewer non-zero coefficients.

4.3 Example: Ridge vs. Lasso Regression

Consider a dataset with multicollinearity (highly correlated independent variables). Ridge regression can handle this by shrinking the coefficients, while Lasso regression might zero out some coefficients, leading to a simpler model.

5. Applications in Data Science

5.1 Predictive Modeling

Least Squares regression is widely used in predictive modeling, where the goal is to predict outcomes based on input features. Regularization techniques like Ridge and Lasso are crucial for improving model generalization.

5.2 Signal Processing

In signal processing, Least Squares methods are used to estimate the parameters of a signal model, especially when the model is linear in its parameters.

5.3 Economics and Finance

Econometric models often rely on Least Squares regression to analyze relationships between economic variables and to forecast future trends based on historical data.

6. Conclusion

The Least Squares method is a cornerstone of linear algebra and statistics, providing a robust framework for solving over-determined systems and performing regression analysis. Understanding the connection between linear algebra and regression enables data scientists and engineers to build predictive models, analyze data, and solve real-world problems with confidence. Regularization techniques like Ridge and Lasso further enhance the applicability of Least Squares regression, particularly in the presence of multicollinearity and high-dimensional data.

1. Introduction to Least Squares​

1.1 What is the Least Squares Method?​

1.2 Why Use Least Squares?​

2. Least Squares in Linear Algebra​

2.1 The Mathematical Formulation​

2.2 Solving the Normal Equations​

2.3 Connection to Matrix Decompositions​

3. Regression Analysis and Least Squares​

3.1 What is Regression Analysis?​

3.2 Linear Regression​

3.3 Fitting the Model Using Least Squares​

3.4 Example: Simple Linear Regression​

4. Regularization in Least Squares Regression​

4.1 Ridge Regression (L2 Regularization)​

4.2 Lasso Regression (L1 Regularization)​

4.3 Example: Ridge vs. Lasso Regression​

5. Applications in Data Science​

5.1 Predictive Modeling​

5.2 Signal Processing​

5.3 Economics and Finance​

6. Conclusion​