A basic soft-margin kernel SVM implementation in Python

26 November 2013

Support Vector Machines (SVMs) are a family of nice supervised learning algorithms that can train classification and regression models efficiently and with very good performance in practice.

SVMs are also rooted in convex optimization and Hilbert space theory, and there is a lot of beautiful mathematics in the derivation of various aspects of the training algorithm, which we will go into in subsequent posts.

For now, we'll just give an introduction to the basic theory of soft-margin kernel SVMs. The classical treatment is to start with hard-margin linear SVMs, then introduce the kernel trick and the soft-margin formulation, so this is somewhat faster-moving than other presentations.

Mathematical Formulation

We consider our training set to be

\begin{equation} D = { (\mathbf{x}_{i}, y_{i}), \mathbf{x} \in \mathbb{R}^d, y \in \{ -1, 1 \} }. \end{equation}

The key idea is that we seek to find a hyperplane $w$ separating our data - and maximimize the margin of this hyperplane to optimize decision-theoretic metrics.

Let $\kappa$ be a kernel function on $\mathbb{R}^d \times \mathbb{R}^d$ - a function such that the matrix $K$ with $K_{ij} = \kappa(x_i, x_j)$ is positive semidefinite. A key property of such kernel functions is that there exists a map $\nu$ such that $\langle \nu(x), \nu(y) \rangle = \kappa(x, y)$. One can think of $\nu$ as mapping our input features into a higher dimensional output space.

We can show that for a given feature mapping $\nu$ satisfying the previous condition, the Lagrangian for the problem of finding the maximum margin hyperplane takes the form:

\begin{equation} \inf_{z \in \mathbb{R}^n} \frac{1}{2} \left| \sum_{i=1}^{n} y_i \nu(x_i) z_i \right|_2^2 - e^T z \end{equation} subject to $z \geq 0$ and $\langle z, y \rangle = 0$.

Given a resulting vector of Lagrange multipliers $z$, we find that most $z$ are zero. This comes from the complementary slackness conditions in our optimization problem - either $(x_i, y_i)$ is on the maximum margin (and so corresponding Lagrange multiplier is nonzero), or it is not on the margin (and so the Lagrange multiplier is zero).

The prediction of a given feature vector $x$ takes the form \begin{align} \label{eq:1} \langle w, \nu(x) \rangle &= \sum_{i=1}^{n} z_{i} y_{i} \langle \nu(x_{i}), \nu(x) \rangle \ &= \sum_{i=1}^{n} z_{i} y_{i} \kappa(x_{i}, x) \end{align} where we can take the sum over only the non-zero $z_{i}$.

This yields a very efficient prediction algorithm - once we have trained our SVM, a large amount of the training data (those samples with zero Lagrangian multipliers) can be removed.

There are more complications (handling the bias term, handling non-separable datasets), but this is the gist of the algorithm.

Implementation

The full implementation of the training (using cvxopt as a quadratic program solver) in Python is given below:

The code is fairly self-explanatory, and follows the given training algorithm quite closely. To compute our Lagrange multipliers, we simply construct the Gram matrix and solve the given QP. We then pass our trained support vectors and their corresponding Lagrange multipliers and weights to the SVMPredictor, whose implementation is given below.

This simply implements the above prediction equation.

A sample list of kernel functions are given in

Demonstration

We demonstrate drawing pairs of independent standard normal variables as features, and label $y_i = sign(\sum x)$. This is trivially linearly seperable, so we train a linear SVM (where $\kappa(x_i, x_j) = \langle x_i, x_j \rangle$) on the sample data.

We then visualize the samples and decision boundary of the SVM on this dataset, using matplotlib. See this gist for details on the implementation.

An example output of this demonstration is given below:

SVM Demonstration

More Information

See the svmpy library on GitHub for all code used in this post.