Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis (LDA) is a powerful technique in statistics and machine learning used for both classification and dimensionality reduction. Unlike PCA, which focuses on maximizing variance without considering class labels, LDA seeks to find a linear combination of features that best separates two or more classes. This article delves into the mathematical foundations of LDA, providing detailed explanations and examples to help you understand how it works and why it’s effective.

1. Introduction to Linear Discriminant Analysis

1.1 What is LDA?

Linear Discriminant Analysis is a supervised learning algorithm primarily used for classification and dimensionality reduction. The goal of LDA is to project the data onto a lower-dimensional space where the separation between classes is maximized.

1.2 Comparison with PCA

While Principal Component Analysis (PCA) focuses on maximizing variance in the data without considering any class labels, LDA takes class labels into account and tries to find the axes that maximize the separation between multiple classes. In essence, LDA is a linear technique that seeks to find a projection that maximizes the distance between means of different classes while minimizing the scatter within each class.

2. Mathematical Foundation of LDA

2.1 The Concept of Discriminants

Given a dataset with $n$ samples, where each sample $x_i$ belongs to one of $K$ classes, LDA seeks to find a linear combination of the input features that separates the classes as much as possible. This can be achieved by finding the linear discriminants that maximize the ratio of between-class variance to within-class variance.

2.2 Between-Class and Within-Class Variance

2.2.1 Within-Class Scatter Matrix ( $\mathbf{S_W}$ )

The within-class scatter matrix is a measure of how much each class spreads out around its own mean. For a given class $k$ , the scatter matrix is defined as:

\mathbf{S_W} = \sum_{k=1}^{K} \sum_{i=1}^{n_k} (\mathbf{x}_i^{(k)} - \mathbf{\mu}_k)(\mathbf{x}_i^{(k)} - \mathbf{\mu}_k)^T

Where:

$n_k$ is the number of samples in class $k$ .
$\mathbf{x}_i^{(k)}$ is the $i$ -th sample in class $k$ .
$\mathbf{\mu}_k$ is the mean vector of class $k$ .

2.2.2 Between-Class Scatter Matrix ( $\mathbf{S_B}$ )

The between-class scatter matrix measures how much the class means deviate from the overall mean. It is defined as:

\mathbf{S_B} = \sum_{k=1}^{K} n_k (\mathbf{\mu}_k - \mathbf{\mu})(\mathbf{\mu}_k - \mathbf{\mu})^T

Where:

$\mathbf{\mu}$ is the overall mean vector of all samples across classes.

2.3 Objective Function of LDA

The objective of LDA is to maximize the ratio of the between-class scatter to the within-class scatter:

\text{argmax}_{\mathbf{w}} \left( \frac{\mathbf{w}^T \mathbf{S_B} \mathbf{w}}{\mathbf{w}^T \mathbf{S_W} \mathbf{w}} \right)

Where:

$\mathbf{w}$ is the vector that defines the linear combination of features.

This ratio is known as the Fisher criterion. The vector $\mathbf{w}$ that maximizes this ratio gives us the best linear discriminant for separating the classes.

2.4 Solving the LDA Optimization Problem

The optimization problem can be solved by finding the eigenvectors and eigenvalues of the matrix $\mathbf{S_W}^{-1} \mathbf{S_B}$ . The eigenvectors corresponding to the largest eigenvalues form the axes that maximize class separability.

To find these eigenvectors, solve the following generalized eigenvalue problem:

\mathbf{S_B} \mathbf{w} = \lambda \mathbf{S_W} \mathbf{w}

Where $\lambda$ represents the eigenvalues.

The solution involves the following steps:

Compute the scatter matrices $\mathbf{S_W}$ and $\mathbf{S_B}$ .
Solve the generalized eigenvalue problem to find the eigenvectors and eigenvalues.
Select the top $K-1$ eigenvectors corresponding to the largest eigenvalues to form the linear discriminants.

3. Example: Applying LDA to a Simple Dataset

3.1 Constructing the Dataset

Consider a simple dataset with two classes. Each class is represented by a set of points in a two-dimensional space.

\text{Class 1: } \mathbf{X}_1 = \begin{pmatrix} 2 & 3 \\ 3 & 4 \\ 4 & 5 \end{pmatrix}, \quad \text{Class 2: } \mathbf{X}_2 = \begin{pmatrix} 6 & 7 \\ 7 & 8 \\ 8 & 9 \end{pmatrix}

3.2 Calculating Class Means

Compute the mean vectors for each class:

\mathbf{\mu}_1 = \frac{1}{3} \sum_{i=1}^{3} \mathbf{x}_i^{(1)} = \begin{pmatrix} 3 \\ 4 \end{pmatrix}, \quad \mathbf{\mu}_2 = \frac{1}{3} \sum_{i=1}^{3} \mathbf{x}_i^{(2)} = \begin{pmatrix} 7 \\ 8 \end{pmatrix}

3.3 Calculating Scatter Matrices

Compute the within-class scatter matrix:

\mathbf{S_W} = \sum_{k=1}^{2} \sum_{i=1}^{3} (\mathbf{x}_i^{(k)} - \mathbf{\mu}_k)(\mathbf{x}_i^{(k)} - \mathbf{\mu}_k)^T

Compute the between-class scatter matrix:

\mathbf{S_B} = \sum_{k=1}^{2} 3 (\mathbf{\mu}_k - \mathbf{\mu})(\mathbf{\mu}_k - \mathbf{\mu})^T

3.4 Solving the Eigenvalue Problem

Solve the eigenvalue problem:

\mathbf{S_B} \mathbf{w} = \lambda \mathbf{S_W} \mathbf{w}

Select the eigenvector corresponding to the largest eigenvalue as the linear discriminant.

3.5 Projecting Data onto the LDA Axis

Finally, project the original data onto the LDA axis using the selected eigenvector:

\mathbf{X}_{\text{LDA}} = \mathbf{X} \mathbf{w}

This projection maximizes the separation between the two classes along the new axis defined by $\mathbf{w}$ .

4. Applications of LDA

4.1 Dimensionality Reduction

LDA can be used for dimensionality reduction by projecting data onto the space spanned by the top $K-1$ linear discriminants. This is particularly useful when the goal is to reduce the number of features while preserving class separability.

4.2 Classification

LDA is also commonly used for classification. By projecting data onto the LDA axes, you can reduce the dimensionality of the dataset and use simple classifiers like linear classifiers to separate the classes effectively.

4.3 Feature Extraction

In addition to dimensionality reduction, LDA can be used for feature extraction. The linear discriminants found by LDA can be treated as new features that capture the most important information for class separation.

5. Limitations of LDA

5.1 Linearity Assumption

LDA assumes that the relationships between features are linear and that the classes are linearly separable. In cases where these assumptions do not hold, LDA may not perform well.

5.2 Homoscedasticity Assumption

LDA assumes that all classes have the same covariance matrix (homoscedasticity). If this assumption is violated, LDA may not effectively separate the classes.

5.3 Sensitivity to Outliers

LDA can be sensitive to outliers, as they can significantly affect the scatter matrices and, consequently, the resulting linear discriminants.

6. Conclusion

6.1 Recap of Key Concepts

Linear Discriminant Analysis (LDA) is a powerful technique for both classification and dimensionality reduction. By maximizing the ratio of between-class variance to within-class variance, LDA finds the linear combinations of features that best separate the classes. Understanding the mathematical foundation of LDA, including scatter matrices and eigenvalue problems, is crucial for effectively applying this technique.

6.2 Next Steps

With a solid understanding of LDA's mathematical foundations, you are now ready to explore its practical implementation using Scikit-learn in upcoming articles. You will learn how to apply LDA for dimensionality reduction and classification in real-world datasets, reinforcing the concepts discussed here.

Linear Discriminant Analysis (LDA) is an essential tool in the data scientist's toolkit, offering a robust method for separating classes in a lower-dimensional space. Mastering LDA involves not only understanding its mathematical underpinnings but also knowing when and how to apply it to achieve the best results.

1. Introduction to Linear Discriminant Analysis​

1.1 What is LDA?​

1.2 Comparison with PCA​

2. Mathematical Foundation of LDA​

2.1 The Concept of Discriminants​

2.2 Between-Class and Within-Class Variance​

2.2.1 Within-Class Scatter Matrix (SW\mathbf{S_W}SW​)​

2.2.2 Between-Class Scatter Matrix (SB\mathbf{S_B}SB​)​

2.3 Objective Function of LDA​

2.4 Solving the LDA Optimization Problem​

3. Example: Applying LDA to a Simple Dataset​

3.1 Constructing the Dataset​

3.2 Calculating Class Means​

3.3 Calculating Scatter Matrices​

3.4 Solving the Eigenvalue Problem​

3.5 Projecting Data onto the LDA Axis​

4. Applications of LDA​

4.1 Dimensionality Reduction​

4.2 Classification​

4.3 Feature Extraction​

5. Limitations of LDA​

5.1 Linearity Assumption​

5.2 Homoscedasticity Assumption​

5.3 Sensitivity to Outliers​

6. Conclusion​

6.1 Recap of Key Concepts​

6.2 Next Steps​

1. Introduction to Linear Discriminant Analysis

1.1 What is LDA?

1.2 Comparison with PCA

2. Mathematical Foundation of LDA

2.1 The Concept of Discriminants

2.2 Between-Class and Within-Class Variance

2.2.1 Within-Class Scatter Matrix ( $\mathbf{S_W}$ )

2.2.2 Between-Class Scatter Matrix ( $\mathbf{S_B}$ )

2.3 Objective Function of LDA

2.4 Solving the LDA Optimization Problem

3. Example: Applying LDA to a Simple Dataset

3.1 Constructing the Dataset

3.2 Calculating Class Means

3.3 Calculating Scatter Matrices

3.4 Solving the Eigenvalue Problem

3.5 Projecting Data onto the LDA Axis

4. Applications of LDA

4.1 Dimensionality Reduction

4.2 Classification

4.3 Feature Extraction

5. Limitations of LDA

5.1 Linearity Assumption

5.2 Homoscedasticity Assumption

5.3 Sensitivity to Outliers

6. Conclusion

6.1 Recap of Key Concepts

6.2 Next Steps