A basic soft-margin kernel SVM implementation in Python
Support Vector Machines (SVMs) are a family of nice supervised learning algorithms that can train classification and regression models efficiently and with very good performance in practice.
SVMs are also rooted in convex optimization and Hilbert space theory, and there is a lot of beautiful mathematics in the derivation of various aspects of the training algorithm, which we will go into in subsequent posts.
For now, we'll just give an introduction to the basic theory of soft-margin kernel SVMs. The classical treatment is to start with hard-margin linear SVMs, then introduce the kernel trick and the soft-margin formulation, so this is somewhat faster-moving than other presentations.
Mathematical Formulation
We consider our training set to be
\begin{equation} D = { (\mathbf{x}_{i}, y_{i}), \mathbf{x} \in \mathbb{R}^d, y \in \{ -1, 1 \} }. \end{equation}
The key idea is that we seek to find a hyperplane $w$ separating our data - and maximimize the margin of this hyperplane to optimize decision-theoretic metrics.
Let $\kappa$ be a kernel function on $\mathbb{R}^d \times \mathbb{R}^d$ - a function such that the matrix $K$ with $K_{ij} = \kappa(x_i, x_j)$ is positive semidefinite. A key property of such kernel functions is that there exists a map $\nu$ such that $\langle \nu(x), \nu(y) \rangle = \kappa(x, y)$. One can think of $\nu$ as mapping our input features into a higher dimensional output space.
We can show that for a given feature mapping $\nu$ satisfying the previous condition, the Lagrangian for the problem of finding the maximum margin hyperplane takes the form:
\begin{equation} \inf_{z \in \mathbb{R}^n} \frac{1}{2} \left| \sum_{i=1}^{n} y_i \nu(x_i) z_i \right|_2^2 - e^T z \end{equation} subject to $z \geq 0$ and $\langle z, y \rangle = 0$.
Given a resulting vector of Lagrange multipliers $z$, we find that most $z$ are zero. This comes from the complementary slackness conditions in our optimization problem - either $(x_i, y_i)$ is on the maximum margin (and so corresponding Lagrange multiplier is nonzero), or it is not on the margin (and so the Lagrange multiplier is zero).
The prediction of a given feature vector $x$ takes the form \begin{align} \label{eq:1} \langle w, \nu(x) \rangle &= \sum_{i=1}^{n} z_{i} y_{i} \langle \nu(x_{i}), \nu(x) \rangle \ &= \sum_{i=1}^{n} z_{i} y_{i} \kappa(x_{i}, x) \end{align} where we can take the sum over only the non-zero $z_{i}$.
This yields a very efficient prediction algorithm - once we have trained our SVM, a large amount of the training data (those samples with zero Lagrangian multipliers) can be removed.
There are more complications (handling the bias term, handling non-separable datasets), but this is the gist of the algorithm.
Implementation
The full implementation of the training (using cvxopt
as a
quadratic program solver) in Python is given below:
The code is fairly self-explanatory, and follows the given training
algorithm quite closely. To compute our Lagrange multipliers, we
simply construct the Gram matrix and solve the given QP. We then pass
our trained support vectors and their corresponding Lagrange
multipliers and weights to the SVMPredictor
, whose implementation is
given below.
This simply implements the above prediction equation.
A sample list of kernel functions are given in
Demonstration
We demonstrate drawing pairs of independent standard normal variables as features, and label $y_i = sign(\sum x)$. This is trivially linearly seperable, so we train a linear SVM (where $\kappa(x_i, x_j) = \langle x_i, x_j \rangle$) on the sample data.
We then visualize the samples and decision boundary of the SVM on this
dataset, using matplotlib
. See this gist for
details on the implementation.
An example output of this demonstration is given below:
More Information
See the svmpy
library on GitHub for all code used in this post.