Skip to main content

Support Vector Machines (SVM) Theory

In this article, we will dive deeper into the theory behind Support Vector Machines (SVMs), focusing on the mathematical foundation that makes SVMs so powerful for both linear and nonlinear classification tasks. We will cover:

  • The concept of hyperplanes and support vectors.
  • Maximizing the margin between classes.
  • The optimization problem solved by SVM.
  • Soft margins and the regularization parameter (C).
  • The kernel trick for nonlinear classification.

1. Hyperplanes and Support Vectors

1.1. Hyperplane

A hyperplane is a decision boundary that separates different classes in the feature space. In a 2D space, the hyperplane is a line, while in higher dimensions, it's a plane or a higher-dimensional analog.

For a binary classification problem, the hyperplane can be defined as:

w1x1+w2x2++wnxn+b=0w_1 x_1 + w_2 x_2 + \dots + w_n x_n + b = 0

Where:

  • w1,w2,,wnw_1, w_2, \dots, w_n are the weights (or coefficients) associated with the input features.
  • bb is the bias term (intercept).
  • x1,x2,,xnx_1, x_2, \dots, x_n are the feature values of the data points.

A data point xx can be classified based on which side of the hyperplane it lies:

  • If wTx+b>0w^T x + b > 0, classify as class 1.
  • If wTx+b<0w^T x + b < 0, classify as class 0.

1.2. Support Vectors

Support vectors are the data points that lie closest to the hyperplane. These points are critical because the position of the hyperplane depends only on the support vectors, not on the other data points. The support vectors define the margin and play a central role in determining the optimal decision boundary.


2. Maximizing the Margin

One of the key principles of SVM is to maximize the margin between the two classes. The margin is the distance between the hyperplane and the nearest data points (support vectors) from each class. By maximizing this margin, the SVM aims to improve the generalization ability of the model, reducing the chance of overfitting.

Definition of Margin:

The margin can be mathematically expressed as the distance from a data point to the hyperplane, given by:

Margin=2w\text{Margin} = \frac{2}{\|w\|}

Where w\|w\| is the norm (magnitude) of the weight vector ww. The goal of SVM is to maximize this margin, which is equivalent to minimizing w\|w\| (the size of the weight vector).


3. The Optimization Problem

The core objective of an SVM is to find the hyperplane that maximizes the margin while minimizing classification errors. This leads to the following optimization problem:

Primal Problem (Hard Margin SVM):

For perfectly separable data, the optimization problem is:

minw,b12w2\min_{w, b} \frac{1}{2} \|w\|^2

Subject to the following constraints for each data point ii:

  • yi(wTxi+b)1y_i (w^T x_i + b) \geq 1

Where:

  • yiy_i is the actual label (either +1+1 or 1-1).
  • xix_i is the feature vector of the ii-th data point.

This optimization problem ensures that each data point is correctly classified, and the margin is maximized. The term 12w2\frac{1}{2} \|w\|^2 is minimized to achieve the largest possible margin.


4. Soft Margin SVM (Handling Misclassification)

In most real-world datasets, the data is not perfectly separable, meaning some data points will inevitably be misclassified. To handle this, soft margin SVM allows some violations of the margin. These violations are controlled by introducing slack variables ξi\xi_i, which measure how much each data point violates the margin.

Soft Margin Optimization Problem:

The soft margin optimization problem is formulated as:

minw,b,ξ12w2+Ci=1nξi\min_{w, b, \xi} \frac{1}{2} \|w\|^2 + C \sum_{i=1}^{n} \xi_i

Subject to the constraints:

  • yi(wTxi+b)1ξiy_i (w^T x_i + b) \geq 1 - \xi_i
  • ξi0\xi_i \geq 0

Where:

  • ξi\xi_i are the slack variables that allow misclassification.
  • CC is the regularization parameter that controls the trade-off between maximizing the margin and minimizing classification errors.

Role of Regularization Parameter (C):

  • Small (C): Encourages a wider margin with more misclassification allowed, leading to better generalization but potentially higher error on the training set.
  • Large (C): Penalizes misclassification heavily, leading to a narrower margin and potentially overfitting the training data.

The parameter CC must be tuned carefully to balance model complexity and generalization.


5. The Kernel Trick

One of the key strengths of SVM is its ability to handle nonlinear decision boundaries using the kernel trick. The kernel function allows the SVM to operate in a higher-dimensional space without explicitly computing the coordinates of the data in that space. This enables the algorithm to create nonlinear decision boundaries in the original feature space.

Common Kernel Functions:

  1. Linear Kernel:

    • Used when the data is linearly separable.
    • Kernel function: K(xi,xj)=xiTxjK(x_i, x_j) = x_i^T x_j
  2. Polynomial Kernel:

    • Adds polynomial features to the model.
    • Kernel function: K(xi,xj)=(xiTxj+c)dK(x_i, x_j) = (x_i^T x_j + c)^d
  3. Radial Basis Function (RBF) Kernel:

    • Used for highly nonlinear data, it maps the data into an infinite-dimensional space.
    • Kernel function: K(xi,xj)=exp(γxixj2)K(x_i, x_j) = \exp(-\gamma \|x_i - x_j\|^2)
  4. Sigmoid Kernel:

    • Similar to neural networks, but less commonly used.
    • Kernel function: K(xi,xj)=tanh(αxiTxj+c)K(x_i, x_j) = \tanh(\alpha x_i^T x_j + c)

Why the Kernel Trick?

Instead of explicitly mapping the data to a higher-dimensional space, the kernel trick allows SVM to compute the dot product of the data points in that space without ever computing their coordinates. This greatly reduces the computational complexity and enables the SVM to efficiently handle nonlinear classification problems.


6. Dual Formulation of SVM

SVMs can also be formulated in their dual form, which is especially useful when applying kernel functions. In the dual form, the optimization problem is rewritten in terms of Lagrange multipliers αi\alpha_i, and the kernel function K(xi,xj)K(x_i, x_j) is introduced.

Dual Problem:

The dual optimization problem is:

maxαi=1nαi12i=1nj=1nαiαjyiyjK(xi,xj)\max_{\alpha} \sum_{i=1}^{n} \alpha_i - \frac{1}{2} \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_i \alpha_j y_i y_j K(x_i, x_j)

Subject to:

  • 0αiC0 \leq \alpha_i \leq C
  • i=1nαiyi=0\sum_{i=1}^{n} \alpha_i y_i = 0

Where αi\alpha_i are the Lagrange multipliers, and K(xi,xj)K(x_i, x_j) is the kernel function. The dual form is advantageous because it only depends on the dot products (or kernel values) between pairs of data points.


7. SVM for Regression (Support Vector Regression)

SVM can also be adapted for regression tasks using Support Vector Regression (SVR). In SVR, the objective is to fit a function within a specified margin while minimizing the error. The optimization problem is similar to classification SVM but focuses on minimizing the prediction error rather than classification error.


Summary

In this article, we explored the mathematical foundation of Support Vector Machines (SVMs), covering key concepts such as:

  • Hyperplanes and support vectors.
  • Maximizing the margin to improve generalization.
  • The soft margin formulation for handling misclassification.
  • The kernel trick for nonlinear classification.
  • The dual formulation of SVM, which is especially useful when applying kernel functions.

Understanding the theory behind SVM is crucial for tuning the model's parameters and choosing the right kernel for your data. In the next section, we will cover practical examples of applying SVM using popular machine learning libraries like scikit-learn, TensorFlow, and PyTorch.