# Cambridge Part III Mathematics Notes

I've cleaned up (somewhat) my notes from Cambridge Part III and have put them online - with LaTeX sources available on GitHub and PDFs linked below.

I've cleaned up (somewhat) my notes from Cambridge Part III and have put them online - with LaTeX sources available on GitHub and PDFs linked below.

As far as I can tell, the LASSO estimator is the closest thing we have to a miracle in modern statistics.

The LASSO estimator is defined as a solution to the minimization problem $\frac{1}{n} \| Y - X \theta \|_2^2 + \lambda \| \theta \|_1$ over $\mathbb{R}^p$. The key insight here is that this is a convex problem in $\theta$ - this follows from both norms being convex and the sum of convex functions being convex. This allows us to design efficient solvers for this problem and thus handle large-scale problems - see, for example, ADMM, iterative shrinkage, gradient projection, etc.

The LASSO can be viewed as convex relaxation of a very natural problem in statistical estimation - finding the best $k$-sparse vector to minimize $\| Y - X \theta \| + \lambda \| \theta \|_0$, where the $L_0$ norm (indeed, not actually a norm) is to be interpreted as the number of non-zero coefficients in $\theta$. This comes from problems such as in signal processing and genomics array data where we have $p$ (the number of covariates) significantly larger than $n$, the number of observations. In this case, the usual least-squares estimation theory dating back to Gauss does not apply ($X$ cannot have full rank), and we must find other alternatives. The brute-force approach is combinatorially hard (we must check each $p \choose k$ sets of supports, which takes time exponential in $p$).

Thus, the LASSO objective can be seen as a natural convex relation of the original problem (e.g. taking $p$ from $0$ upwards and stopping as soon as $\| \theta \|_p$ is convex).

The "miracle" I refer to is the amazing result of Candes & Tao in a series of papers starting in 2005 that established that for a large class of observation matrices $X$, we have the amazing result that with very high probability, solving the LASSO problem is equivalent to solving the original combinatorially hard problem. Formally, we have the following theorem, which contains a germ of the restricted isometry property.

Let $\theta_0$ be $k$-sparse with support $S_0$, and let $Y = X \theta + \epsilon$, with $\epsilon \sim N(0, \sigma^2 I_n)$. Let $\tilde \theta$ be the LASSO estimator of $(Y, X)$ with parameter $\lambda = 4 \overline \sigma \sqrt{\frac{t^2 + 2 \log p}{n}}$. Assume that $\| \tilde \theta_{S_0} - \theta_0 \|_1^2 \leq k r_0 (\tilde \theta - \theta_0)^T \hat \Sigma (\tilde \theta - \theta_0)$ with probability at least $1 - \beta$ for some $r_0$. Then with probability at least $1 - \beta - e^{-\frac{t^2}{2}}$, we have that \begin{equation} \frac{1}{n} \| X(\tilde \theta - \theta_0) \|_2^2 + \lambda \| \tilde \theta - \theta_0 \| \leq 4 k r_0 \lambda^2 \end{equation}

The proof is fairly elementary, requiring only a basic concentration of measure inequality for subgaussian random variables. The key step is recognizing that we can bound the event $\max_{j} \frac{2}{n} \| (\epsilon^T X)_j \| \geq \frac{\lambda}{2}$ on a set of measure less than $e^-\frac{t^2}{2}$. Once we have this result, we can apply the triangle inequality and the restricted isometry condition in the theorem to obtain the desired result.

About a year ago I switched from Vim to Emacs, and I couldn't be happier about the move. I spent some time getting a setup I was happy with, and thought I'd share it for those who are also looking to move to Emacs.

For more information, my entire `.emacs.d`

is available in my
dots repository on GitHub.

A cute interview question I've had is "given an $n \times n$ board of characters and a dictionary, find all possible words formed by a self-avoiding path in the grid". This is otherwise known as "Boggle".

A simple solution is just conducting a graph traversal of the game board - for each of the $n^2$ positions, we conduct a DFS search starting at that position, tracking the previously visited positions at each stage.

This is trivial to do in any language, but I thought it would be
an interesting application of Simon Marlow's `Control.Monad.Parallel`

Haskell library to conduct these traversals in parallel.

The full implementation is available on GitHub at https://github.com/ajtulloch/boggle.

In Speeding up isotonic regression in scikit-learn, we dropped down into Cython to improve the performance of a regression algorithm. I thought it would be interesting to compare the performance of this (optimized) code in Python against the naive Julia implementation.

This article continues on from the previous one, so it may be worth reading that before continuing here to obtain the necessary background information.

We'll implement both of the algorithms for the previous article, and compare their performance in Julia against Python.

Isotonic regression is a useful non-parametric regression technique for fitting an increasing function to a given dataset.

A classic use is in improving the calibration of a probabilistic classifier. Say we have a set of 0/1 data-points (e.g. ad clicks), and we train a probabilistic classifier on this dataset.

Unfortunately, we find that our classifier is poorly calibrated - for cases where it predicts ~50% probability of a click, there is actually a 20% probability of a click, and so on.

In this case, we can learn an isotonic regression model on the output of the classifier, where our increasing function we fit is $\mathcal{P}(+ \, | \, \text{classifiers prediction})$. The constraint that the function is increasing means that the ordering of events is preserved by the transformation, which is an important constraint.

With a trained isotonic regression model, our final output is the composition of the classifiers prediction with the isotonic regression function.

For an example of this usage, see the Google Ad Click Prediction - A View from the Trenches paper from KDD 2013, which covers this technique in section 7. The AdPredictor ICML paper paper also uses this technique for calibrating a Naive Bayes predictor.

We'll now detail how we made the `scikit-learn`

implementation of isotonic regression more than ~5,000x faster, while
reducing the number of lines of code in the implementation.

The Stripe CTF Distributed Systems edition has just finished, and I passed all the levels (and was one of the first fifty contestants to finish). In constructing my solutions, I thought it would be an interesting challenge to attempt to construct the minimal changes to the reference solutions that are sufficient to pass the scoring requirements.

To be clear - these aren't high-quality (or high-scoring) solutions.
I'm not especially proud of these. They are just *small* solutions,
and somewhat interesting for that reason.

UPDATE: Compiled evaluation is now implemented for
scikit-learn regression tree/ensemble
models, available at
https://github.com/ajtulloch/sklearn-compiledtrees or ```
pip install
sklearn-compiledtrees
```

.

Our previous article on decision trees dealt
with techniques to speed up the training process, though often the
performance-critical component of the machine learning pipeline is the
prediction side. Training takes place offline, whereas predictions are
often in the hot path - consider ranking documents in response to a
user query *a-la* Google, Bing, etc. Many candidate documents need to
be scored as quickly as possible, and the top *k* results returned to
the user.

Here, we'll focus on on a few methods to improve the performance of evaluating an ensemble of decision trees - encompassing random forests, gradient boosted decision trees, AdaBoost, etc.

There are three methods we'll focus on here:

- Recursive tree walking (
*naive*) - Flattening the decision tree (
*flattened*) - Compiling the tree to machine code (
*compiled*)

We'll show that choosing the right strategy can improve evaluation time by more than 2x - which can be a very significant performance improvement indeed.

All code (implementation, drivers, analysis scripts) are available on GitHub at the decisiontrees-performance repository.

Online learning (as opposed to more traditional batched machine learning) is more and more commonly applied to training machine learned models at scale. The chief advantage is that the model is trained via a streaming approach, and thus the entire dataset used when training does not need to be held in memory at any given time.

That is to say, we can consider that a model parameters have a current state $\mathbf{w}$, and we observe our examples $(y, \mathbf{x})$ with ($y$ the label and $x$ the features of the given example) in a streaming fashion. At each example, we update our weights from the given example, and these weights are used as a starting point.

Microsoft's AdPredictor model from ICML 2010 is an online learning model that has been successfully applied in the context of click prediction for online advertising.

Here, we'll implement the AdPredictor algorithm (in Python), and demonstrate how online learning works via visualizations of the trained parameters $\mathbf{w}$.

Support Vector Machines (SVMs) are a family of nice supervised learning algorithms that can train classification and regression models efficiently and with very good performance in practice.

SVMs are also rooted in convex optimization and Hilbert space theory, and there is a lot of beautiful mathematics in the derivation of various aspects of the training algorithm, which we will go into in subsequent posts.

For now, we'll just give an introduction to the basic theory of soft-margin kernel SVMs. The classical treatment is to start with hard-margin linear SVMs, then introduce the kernel trick and the soft-margin formulation, so this is somewhat faster-moving than other presentations.

Let $\Theta \subseteq \mathbb{R}^{p}$ be compact. Let $Q: \Theta \rightarrow \mathbb{R}$ be a continuous, non-random function that has a unique minimizer $\theta_{0} \in \Theta$.

Let $Q_{n}: \Theta \rightarrow \mathbb{R}$ be any sequence of random functions such that

\begin{equation} \sup_{\theta \in \Theta} |Q_{n}(\theta) - Q(\theta)| \rightarrow 0 \end{equation} as $n \rightarrow \infty$.

If $\theta_{n}$ is *any* sequence of minimizers of $Q_{n}$,
then $\hat \theta_{n} \rightarrow \theta_{0}$ in probability as $n \rightarrow \infty$.

The classic algorithm for training a decision tree for classification/regression problems (CART) is well known. The underlying algorithm acts by recursively partitioning the dataset into subsets that maximize the 'clustering' of examples in each of the partitioned subsets, where the metric used for clustering varies depending on the problem (for example, information gain, Gini loss, etc, have been used successfully in the literature).

For a high level overview of the algorithm, see the following snippet of Haskell code code from the haskell-ml project project.

The *Itō isometry* is a useful theorem in stochastic calculus that
provides a fundamental tool in computing stochastic integrals -
integrals with respect to a Brownian motion \begin{equation}
\int_{0}^{\infty} f(s) dB_{s} \end{equation} with $B_{s}$ a
Brownian motion.

First, we'll define a predictable process.

Haskell is an absolute pleasure to write code in, and I've been trying to use it more and more. It's a language that rewards extended effort in a way that `C++`

et. al. do not.

Consider the following program, illustrating a basic BFS/DFS search through a tree in Haskell. It illustrates a number of useful concepts - algebraic data types, type classes, and QuickCheck.

Gradient boosted decision trees are an effective off-the-shelf method for generating effective models for classification and regression tasks.

Gradient boosting is a generic technique that can be applied to arbitrary 'underlying' weak learners - typically decision trees are used. The seminal reference is Friedman's 2001 paper that introduced the method.

Consider a typical supervised learning problem - we are given as input labeled feature vectors $(x, y)$, and seek to estimate a function $\hat F(x)$ that approximates the 'true' mapping $F^\star$, with $F^\star$ minimizing the expected loss $L(y, F(x)$ over some candidate functions $\mathcal{F}$ for a loss function $L$.

This is a compilation of various sets of lecture notes I created
during my Bachelors degree in Mathematics at the University of Sydney.
All `.tex`

files are available at the
GitHub repository.

The Stanford textbook Elements of Statistical Learning by
Hastie, Tibshirani, and Friedman
is an excellent (and freely available) graduate-level
text in data mining and machine learning. I'm currently working
through it, and I'm putting my (partial) exercise solutions up for
anyone who might find them useful. The first set of solutions is for
Chapter 2, *An Overview of Supervised Learning*, introducing least
squares and *k*-nearest-neighbour techniques.

See the solutions in PDF format (source) for a more pleasant reading experience. This webpage was created from the LaTeX source using the LaTeX2Markdown utility - check it out on GitHub.