A basic soft-margin kernel SVM implementation in Python

Support Vector Machines (SVMs) are a family of nice supervised learning algorithms that can train classification and regression models efficiently and with very good performance in practice.

SVMs are also rooted in convex optimization and Hilbert space theory, and there is a lot of beautiful mathematics in the derivation of various aspects of the training algorithm, which we will go into in subsequent posts.

For now, we'll just give an introduction to the basic theory of soft-margin kernel SVMs. The classical treatment is to start with hard-margin linear SVMs, then introduce the kernel trick and the soft-margin formulation, so this is somewhat faster-moving than other presentations.

Mathematical Formulation

We consider our training set to be

D=(xi,yi),xRd,y{1,1}.

The key idea is that we seek to find a hyperplane w separating our data - and maximimize the margin of this hyperplane to optimize decision-theoretic metrics.

Let κ be a kernel function on Rd×Rd - a function such that the matrix K with Kij=κ(xi,xj) is positive semidefinite. A key property of such kernel functions is that there exists a map ν such that ν(x),ν(y)=κ(x,y). One can think of ν as mapping our input features into a higher dimensional output space.

We can show that for a given feature mapping ν satisfying the previous condition, the Lagrangian for the problem of finding the maximum margin hyperplane takes the form:

infzRn12|ni=1yiν(xi)zi|22eTz

subject to z0 and z,y=0.

Given a resulting vector of Lagrange multipliers z, we find that most z are zero. This comes from the complementary slackness conditions in our optimization problem - either (xi,yi) is on the maximum margin (and so corresponding Lagrange multiplier is nonzero), or it is not on the margin (and so the Lagrange multiplier is zero).

The prediction of a given feature vector x takes the form w,ν(x)=ni=1ziyiν(xi),ν(x) =ni=1ziyiκ(xi,x)

where we can take the sum over only the non-zero zi.

This yields a very efficient prediction algorithm - once we have trained our SVM, a large amount of the training data (those samples with zero Lagrangian multipliers) can be removed.

There are more complications (handling the bias term, handling non-separable datasets), but this is the gist of the algorithm.

Implementation

The full implementation of the training (using cvxopt as a quadratic program solver) in Python is given below:

class SVMTrainer(object):
def __init__(self, kernel, c):
self._kernel = kernel
self._c = c
def train(self, X, y):
"""Given the training features X with labels y, returns a SVM
predictor representing the trained SVM.
"""
lagrange_multipliers = self._compute_multipliers(X, y)
return self._construct_predictor(X, y, lagrange_multipliers)
def _gram_matrix(self, X):
n_samples, n_features = X.shape
K = np.zeros((n_samples, n_samples))
# TODO(tulloch) - vectorize
for i, x_i in enumerate(X):
for j, x_j in enumerate(X):
K[i, j] = self._kernel(x_i, x_j)
return K
def _construct_predictor(self, X, y, lagrange_multipliers):
support_vector_indices = \
lagrange_multipliers > MIN_SUPPORT_VECTOR_MULTIPLIER
support_multipliers = lagrange_multipliers[support_vector_indices]
support_vectors = X[support_vector_indices]
support_vector_labels = y[support_vector_indices]
# http://www.cs.cmu.edu/~guestrin/Class/10701-S07/Slides/kernels.pdf
# bias = y_k - \sum z_i y_i K(x_k, x_i)
# Thus we can just predict an example with bias of zero, and
# compute error.
bias = np.mean(
[y_k - SVMPredictor(
kernel=self._kernel,
bias=0.0,
weights=support_multipliers,
support_vectors=support_vectors,
support_vector_labels=support_vector_labels).predict(x_k)
for (y_k, x_k) in zip(support_vector_labels, support_vectors)])
return SVMPredictor(
kernel=self._kernel,
bias=bias,
weights=support_multipliers,
support_vectors=support_vectors,
support_vector_labels=support_vector_labels)
def _compute_multipliers(self, X, y):
n_samples, n_features = X.shape
K = self._gram_matrix(X)
# Solves
# min 1/2 x^T P x + q^T x
# s.t.
# Gx \coneleq h
# Ax = b
P = cvxopt.matrix(np.outer(y, y) * K)
q = cvxopt.matrix(-1 * np.ones(n_samples))
# -a_i \leq 0
# TODO(tulloch) - modify G, h so that we have a soft-margin classifier
G_std = cvxopt.matrix(np.diag(np.ones(n_samples) * -1))
h_std = cvxopt.matrix(np.zeros(n_samples))
# a_i \leq c
G_slack = cvxopt.matrix(np.diag(np.ones(n_samples)))
h_slack = cvxopt.matrix(np.ones(n_samples) * self._c)
G = cvxopt.matrix(np.vstack((G_std, G_slack)))
h = cvxopt.matrix(np.vstack((h_std, h_slack)))
A = cvxopt.matrix(y, (1, n_samples))
b = cvxopt.matrix(0.0)
solution = cvxopt.solvers.qp(P, q, G, h, A, b)
# Lagrange multipliers
return np.ravel(solution['x'])
view raw svm.py hosted with ❤ by GitHub

The code is fairly self-explanatory, and follows the given training algorithm quite closely. To compute our Lagrange multipliers, we simply construct the Gram matrix and solve the given QP. We then pass our trained support vectors and their corresponding Lagrange multipliers and weights to the SVMPredictor, whose implementation is given below.

class SVMPredictor(object):
def __init__(self,
kernel,
bias,
weights,
support_vectors,
support_vector_labels):
self._kernel = kernel
self._bias = bias
self._weights = weights
self._support_vectors = support_vectors
self._support_vector_labels = support_vector_labels
def predict(self, x):
"""
Computes the SVM prediction on the given features x.
"""
result = self._bias
for z_i, x_i, y_i in zip(self._weights,
self._support_vectors,
self._support_vector_labels):
result += z_i * y_i * self._kernel(x_i, x)
return np.sign(result).item()
view raw svm.py hosted with ❤ by GitHub

This simply implements the above prediction equation.

A sample list of kernel functions are given in

import numpy as np
import numpy.linalg as la
class Kernel(object):
"""Implements list of kernels from
http://en.wikipedia.org/wiki/Support_vector_machine
"""
@staticmethod
def linear():
def f(x, y):
return np.inner(x, y)
return f
@staticmethod
def gaussian(sigma):
def f(x, y):
exponent = -np.sqrt(la.norm(x-y) ** 2 / (2 * sigma ** 2))
return np.exp(exponent)
return f
@staticmethod
def _polykernel(dimension, offset):
def f(x, y):
return (offset + np.dot(x, y)) ** dimension
return f
@staticmethod
def inhomogenous_polynomial(dimension):
return Kernel._polykernel(dimension=dimension, offset=1.0)
@staticmethod
def homogenous_polynomial(dimension):
return Kernel._polykernel(dimension=dimension, offset=0.0)
@staticmethod
def hyperbolic_tangent(kappa, c):
def f(x, y):
return np.tanh(kappa * np.dot(x, y) + c)
return f
view raw kernel.py hosted with ❤ by GitHub

Demonstration

We demonstrate drawing pairs of independent standard normal variables as features, and label yi=sign(x). This is trivially linearly seperable, so we train a linear SVM (where κ(xi,xj)=xi,xj) on the sample data.

We then visualize the samples and decision boundary of the SVM on this dataset, using matplotlib. See this gist for details on the implementation.

An example output of this demonstration is given below:

SVM Demonstration

More Information

See the svmpy library on GitHub for all code used in this post.