A basic soft-margin kernel SVM implementation in Python
Support Vector Machines (SVMs) are a family of nice supervised learning algorithms that can train classification and regression models efficiently and with very good performance in practice.
SVMs are also rooted in convex optimization and Hilbert space theory, and there is a lot of beautiful mathematics in the derivation of various aspects of the training algorithm, which we will go into in subsequent posts.
For now, we'll just give an introduction to the basic theory of soft-margin kernel SVMs. The classical treatment is to start with hard-margin linear SVMs, then introduce the kernel trick and the soft-margin formulation, so this is somewhat faster-moving than other presentations.
Mathematical Formulation
We consider our training set to be
D=(xi,yi),x∈Rd,y∈{−1,1}.
The key idea is that we seek to find a hyperplane w separating our data - and maximimize the margin of this hyperplane to optimize decision-theoretic metrics.
Let κ be a kernel function on Rd×Rd - a function such that the matrix K with Kij=κ(xi,xj) is positive semidefinite. A key property of such kernel functions is that there exists a map ν such that ⟨ν(x),ν(y)⟩=κ(x,y). One can think of ν as mapping our input features into a higher dimensional output space.
We can show that for a given feature mapping ν satisfying the previous condition, the Lagrangian for the problem of finding the maximum margin hyperplane takes the form:
infz∈Rn12|n∑i=1yiν(xi)zi|22−eTz
Given a resulting vector of Lagrange multipliers z, we find that most z are zero. This comes from the complementary slackness conditions in our optimization problem - either (xi,yi) is on the maximum margin (and so corresponding Lagrange multiplier is nonzero), or it is not on the margin (and so the Lagrange multiplier is zero).
The prediction of a given feature vector x takes the form ⟨w,ν(x)⟩=n∑i=1ziyi⟨ν(xi),ν(x)⟩ =n∑i=1ziyiκ(xi,x)
This yields a very efficient prediction algorithm - once we have trained our SVM, a large amount of the training data (those samples with zero Lagrangian multipliers) can be removed.
There are more complications (handling the bias term, handling non-separable datasets), but this is the gist of the algorithm.
Implementation
The full implementation of the training (using cvxopt
as a
quadratic program solver) in Python is given below:
class SVMTrainer(object): | |
def __init__(self, kernel, c): | |
self._kernel = kernel | |
self._c = c | |
def train(self, X, y): | |
"""Given the training features X with labels y, returns a SVM | |
predictor representing the trained SVM. | |
""" | |
lagrange_multipliers = self._compute_multipliers(X, y) | |
return self._construct_predictor(X, y, lagrange_multipliers) | |
def _gram_matrix(self, X): | |
n_samples, n_features = X.shape | |
K = np.zeros((n_samples, n_samples)) | |
# TODO(tulloch) - vectorize | |
for i, x_i in enumerate(X): | |
for j, x_j in enumerate(X): | |
K[i, j] = self._kernel(x_i, x_j) | |
return K | |
def _construct_predictor(self, X, y, lagrange_multipliers): | |
support_vector_indices = \ | |
lagrange_multipliers > MIN_SUPPORT_VECTOR_MULTIPLIER | |
support_multipliers = lagrange_multipliers[support_vector_indices] | |
support_vectors = X[support_vector_indices] | |
support_vector_labels = y[support_vector_indices] | |
# http://www.cs.cmu.edu/~guestrin/Class/10701-S07/Slides/kernels.pdf | |
# bias = y_k - \sum z_i y_i K(x_k, x_i) | |
# Thus we can just predict an example with bias of zero, and | |
# compute error. | |
bias = np.mean( | |
[y_k - SVMPredictor( | |
kernel=self._kernel, | |
bias=0.0, | |
weights=support_multipliers, | |
support_vectors=support_vectors, | |
support_vector_labels=support_vector_labels).predict(x_k) | |
for (y_k, x_k) in zip(support_vector_labels, support_vectors)]) | |
return SVMPredictor( | |
kernel=self._kernel, | |
bias=bias, | |
weights=support_multipliers, | |
support_vectors=support_vectors, | |
support_vector_labels=support_vector_labels) | |
def _compute_multipliers(self, X, y): | |
n_samples, n_features = X.shape | |
K = self._gram_matrix(X) | |
# Solves | |
# min 1/2 x^T P x + q^T x | |
# s.t. | |
# Gx \coneleq h | |
# Ax = b | |
P = cvxopt.matrix(np.outer(y, y) * K) | |
q = cvxopt.matrix(-1 * np.ones(n_samples)) | |
# -a_i \leq 0 | |
# TODO(tulloch) - modify G, h so that we have a soft-margin classifier | |
G_std = cvxopt.matrix(np.diag(np.ones(n_samples) * -1)) | |
h_std = cvxopt.matrix(np.zeros(n_samples)) | |
# a_i \leq c | |
G_slack = cvxopt.matrix(np.diag(np.ones(n_samples))) | |
h_slack = cvxopt.matrix(np.ones(n_samples) * self._c) | |
G = cvxopt.matrix(np.vstack((G_std, G_slack))) | |
h = cvxopt.matrix(np.vstack((h_std, h_slack))) | |
A = cvxopt.matrix(y, (1, n_samples)) | |
b = cvxopt.matrix(0.0) | |
solution = cvxopt.solvers.qp(P, q, G, h, A, b) | |
# Lagrange multipliers | |
return np.ravel(solution['x']) |
The code is fairly self-explanatory, and follows the given training
algorithm quite closely. To compute our Lagrange multipliers, we
simply construct the Gram matrix and solve the given QP. We then pass
our trained support vectors and their corresponding Lagrange
multipliers and weights to the SVMPredictor
, whose implementation is
given below.
class SVMPredictor(object): | |
def __init__(self, | |
kernel, | |
bias, | |
weights, | |
support_vectors, | |
support_vector_labels): | |
self._kernel = kernel | |
self._bias = bias | |
self._weights = weights | |
self._support_vectors = support_vectors | |
self._support_vector_labels = support_vector_labels | |
def predict(self, x): | |
""" | |
Computes the SVM prediction on the given features x. | |
""" | |
result = self._bias | |
for z_i, x_i, y_i in zip(self._weights, | |
self._support_vectors, | |
self._support_vector_labels): | |
result += z_i * y_i * self._kernel(x_i, x) | |
return np.sign(result).item() |
This simply implements the above prediction equation.
A sample list of kernel functions are given in
import numpy as np | |
import numpy.linalg as la | |
class Kernel(object): | |
"""Implements list of kernels from | |
http://en.wikipedia.org/wiki/Support_vector_machine | |
""" | |
@staticmethod | |
def linear(): | |
def f(x, y): | |
return np.inner(x, y) | |
return f | |
@staticmethod | |
def gaussian(sigma): | |
def f(x, y): | |
exponent = -np.sqrt(la.norm(x-y) ** 2 / (2 * sigma ** 2)) | |
return np.exp(exponent) | |
return f | |
@staticmethod | |
def _polykernel(dimension, offset): | |
def f(x, y): | |
return (offset + np.dot(x, y)) ** dimension | |
return f | |
@staticmethod | |
def inhomogenous_polynomial(dimension): | |
return Kernel._polykernel(dimension=dimension, offset=1.0) | |
@staticmethod | |
def homogenous_polynomial(dimension): | |
return Kernel._polykernel(dimension=dimension, offset=0.0) | |
@staticmethod | |
def hyperbolic_tangent(kappa, c): | |
def f(x, y): | |
return np.tanh(kappa * np.dot(x, y) + c) | |
return f |
Demonstration
We demonstrate drawing pairs of independent standard normal variables as features, and label yi=sign(∑x). This is trivially linearly seperable, so we train a linear SVM (where κ(xi,xj)=⟨xi,xj⟩) on the sample data.
We then visualize the samples and decision boundary of the SVM on this
dataset, using matplotlib
. See this gist for
details on the implementation.
An example output of this demonstration is given below:
More Information
See the svmpy
library on GitHub for all code used in this post.