Theoretical Foundations of t-SNE (t-Distributed Stochastic Neighbor Embedding)

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a non-linear dimensionality reduction technique that has become a standard tool for visualizing high-dimensional data. Developed by Laurens van der Maaten and Geoffrey Hinton in 2008, t-SNE is particularly effective at revealing the structure of data by reducing its dimensions to two or three, which can then be easily visualized. This article explores the mathematical foundations of t-SNE, providing an in-depth understanding of how the algorithm works and why it is so effective.

1. Understanding the Problem

High-dimensional datasets are often difficult to interpret because the relationships between data points are not easily visualized. Dimensionality reduction techniques, like PCA, project data into lower dimensions, but they may not capture the non-linear relationships inherent in the data. t-SNE addresses this by focusing on preserving the local structure of the data, which is crucial for identifying clusters and patterns.

2. Mathematical Formulation

t-SNE builds on the idea that similar points in high-dimensional space should remain close in the lower-dimensional representation, while dissimilar points should be far apart. This is achieved through the following steps:

2.1. Pairwise Similarities in High Dimensions

Given a high-dimensional dataset, t-SNE starts by converting the pairwise Euclidean distances between data points into conditional probabilities that represent similarities. For two data points $x_i$ and $x_j$ , the similarity of $x_j$ to $x_i$ is given by:

P_{j|i} = \frac{\exp\left(-\|x_i - x_j\|^2 / 2\sigma_i^2\right)}{\sum_{k \neq i} \exp\left(-\|x_i - x_k\|^2 / 2\sigma_i^2\right)}

Where:

$\|x_i - x_j\|^2$ is the squared Euclidean distance between points $x_i$ and $x_j$ .
$\sigma_i$ is a Gaussian kernel bandwidth parameter specific to point $x_i$ .

The joint probability distribution for the pair $(x_i, x_j)$ is then defined as:

P_{ij} = \frac{P_{j|i} + P_{i|j}}{2n}

Where $n$ is the total number of points in the dataset.

2.2. Pairwise Similarities in Low Dimensions

In the lower-dimensional space, t-SNE aims to create a similar probability distribution for the points. For the low-dimensional counterparts $y_i$ and $y_j$ of the original points $x_i$ and $x_j$ , the similarity is modeled using a Student's t-distribution with one degree of freedom (which has heavier tails than a Gaussian distribution):

Q_{ij} = \frac{\left(1 + \|y_i - y_j\|^2\right)^{-1}}{\sum_{k \neq l} \left(1 + \|y_k - y_l\|^2\right)^{-1}}

The use of the t-distribution instead of a Gaussian distribution helps t-SNE to place dissimilar points further apart in the lower-dimensional space, reducing the "crowding problem" common in dimensionality reduction.

2.3. Kullback-Leibler Divergence

To measure how well the lower-dimensional distribution $Q_{ij}$ matches the high-dimensional distribution $P_{ij}$ , t-SNE uses the Kullback-Leibler (KL) divergence:

\text{KL}(P \| Q) = \sum_{i \neq j} P_{ij} \log \frac{P_{ij}}{Q_{ij}}

t-SNE attempts to minimize this divergence, adjusting the positions of the points in the low-dimensional space to make the two distributions as similar as possible.

2.4. Optimization

Minimizing the KL divergence is a non-convex optimization problem, typically solved using gradient descent. The gradients of the cost function with respect to the point positions $y_i$ are computed as follows:

\frac{\partial C}{\partial y_i} = 4 \sum_{j} \left(P_{ij} - Q_{ij}\right) \left(y_i - y_j\right) \left(1 + \|y_i - y_j\|^2\right)^{-1}

This gradient is used to iteratively update the positions of the points in the lower-dimensional space until the cost function converges.

3. Perplexity and Bandwidth

One of the key parameters in t-SNE is the perplexity, which influences the choice of the bandwidth $\sigma_i$ of the Gaussian kernels. The perplexity is a measure of the effective number of neighbors and is typically set between 5 and 50. It is defined as:

\text{Perplexity}(P_i) = 2^{H(P_i)}

Where $H(P_i)$ is the Shannon entropy of the conditional probability distribution $P_i$ :

H(P_i) = - \sum_{j} P_{j|i} \log_2 P_{j|i}

The bandwidth $\sigma_i$ is adjusted iteratively so that the perplexity of the distribution $P_i$ matches the user-defined value.

4. Computational Complexity

While t-SNE produces highly informative visualizations, it is computationally expensive. The exact algorithm has a time complexity of $O(n^2)$ , where $n$ is the number of data points, making it challenging to apply to very large datasets. However, various optimizations, such as Barnes-Hut t-SNE, reduce the complexity to $O(n \log n)$ , making the algorithm more scalable.

5. Best Practices for Using t-SNE

5.1. Parameter Tuning

Choosing the right parameters, especially perplexity, is crucial for getting meaningful results from t-SNE. It is advisable to experiment with different perplexity values and observe how they affect the visualization.

5.2. Data Preprocessing

Preprocessing the data by normalizing or standardizing it can significantly impact the results. It is essential to ensure that the data is appropriately scaled before applying t-SNE.

5.3. Interpretation of Results

Interpreting t-SNE plots requires caution. t-SNE is effective at preserving local structures, but the distances between clusters in the final plot may not correspond to actual distances in the original high-dimensional space.

6. Conclusion

t-SNE is a powerful tool for visualizing high-dimensional data, enabling the discovery of patterns, clusters, and relationships that may not be apparent with other techniques. Its mathematical foundation, based on preserving local neighborhoods, makes it particularly effective for exploring the underlying structure of complex datasets. However, its computational demands and sensitivity to parameter settings require careful consideration and understanding to achieve optimal results.

1. Understanding the Problem​

2. Mathematical Formulation​

2.1. Pairwise Similarities in High Dimensions​

2.2. Pairwise Similarities in Low Dimensions​

2.3. Kullback-Leibler Divergence​

2.4. Optimization​

3. Perplexity and Bandwidth​

4. Computational Complexity​

5. Best Practices for Using t-SNE​

5.1. Parameter Tuning​

5.2. Data Preprocessing​

5.3. Interpretation of Results​

6. Conclusion​