# The Performance of Decision Tree Evaluation Strategies

Our previous article on decision trees dealt with techniques to speed up the evaluation process, though often the performance-critical component of the machine learning pipeline is the prediction side. Training takes place offline, whereas predictions are often in the hot path - consider ranking documents in response to a user query a-la Google, Bing, etc. Many candidate documents need to be scored as quickly as possible, and the top k results returned to the user.

Here, we'll focus on on a few methods to improve the performance of evaluating an ensemble of decision trees - encompassing random forests, gradient boosted decision trees, AdaBoost, etc.

There are three methods we'll focus on here:

• Recursive tree walking (naive)
• Flattening the decision tree (flattened)
• Compiling the tree to machine code (compiled)

We'll show that choosing the right strategy can improve evaluation time by more than 2x - which can be a very significant performance improvement indeed.

All code (implementation, drivers, analysis scripts) are available on GitHub at the decisiontrees-performance repository.

Continue...

# Online Learning with Microsoft's AdPredictor algorithm

Online learning (as opposed to more traditional batched machine learning) is more and more commonly applied to training machine learned models at scale. The chief advantage is that the model is trained via a streaming approach, and thus the entire dataset used when training does not need to be held in memory at any given time.

That is to say, we can consider that a model parameters have a current state $\mathbf{w}$, and we observe our examples $(y, \mathbf{x})$ with ($y$ the label and $x$ the features of the given example) in a streaming fashion. At each example, we update our weights from the given example, and these weights are used as a starting point.

Microsoft's AdPredictor model from ICML 2010 is an online learning model that has been successfully applied in the context of click prediction for online advertising.

Here, we'll implement the AdPredictor algorithm (in Python), and demonstrate how online learning works via visualizations of the trained parameters $\mathbf{w}$.

Continue...

# A basic soft-margin kernel SVM implementation in Python

Support Vector Machines (SVMs) are a family of nice supervised learning algorithms that can train classification and regression models efficiently and with very good performance in practice.

SVMs are also rooted in convex optimization and Hilbert space theory, and there is a lot of beautiful mathematics in the derivation of various aspects of the training algorithm, which we will go into in subsequent posts.

For now, we'll just give an introduction to the basic theory of soft-margin kernel SVMs. The classical treatment is to start with hard-margin linear SVMs, then introduce the kernel trick and the soft-margin formulation, so this is somewhat faster-moving than other presentations.

Continue...

# Consistency of M-estimators

Let $\Theta \subseteq \mathbb{R}^{p}$ be compact. Let $Q: \Theta \rightarrow \mathbb{R}$ be a continuous, non-random function that has a unique minimizer $\theta_{0} \in \Theta$.

Let $Q_{n}: \Theta \rightarrow \mathbb{R}$ be any sequence of random functions such that

$$\sup_{\theta \in \Theta} |Q_{n}(\theta) - Q(\theta)| \rightarrow 0$$ as $n \rightarrow \infty$.

If $\theta_{n}$ is any sequence of minimizers of $Q_{n}$, then $\hat \theta_{n} \rightarrow \theta_{0}$ in probability as $n \rightarrow \infty$.

• statistics

# Speeding up Decision Tree Training

The classic algorithm for training a decision tree for classification/regression problems (CART) is well known. The underlying algorithm acts by recursively partitioning the dataset into subsets that maximize the 'clustering' of examples in each of the partitioned subsets, where the metric used for clustering varies depending on the problem (for example, information gain, Gini loss, etc, have been used successfully in the literature).

For a high level overview of the algorithm, see the following snippet of Haskell code code from the haskell-ml project project.

Continue...

# The Ito isometry

The Itō isometry is a useful theorem in stochastic calculus that provides a fundamental tool in computing stochastic integrals - integrals with respect to a Brownian motion $$\int_{0}^{\infty} f(s) dB_{s}$$ with $B_{s}$ a Brownian motion.

First, we'll define a predictable process.

Continue...

• statistics
• mathematics

# Purely Functional Tree Search in Haskell

Haskell is an absolute pleasure to write code in, and I've been trying to use it more and more. It's a language that rewards extended effort in a way that C++ et. al. do not.

Consider the following program, illustrating a basic BFS/DFS search through a tree in Haskell. It illustrates a number of useful concepts - algebraic data types, type classes, and QuickCheck.

Continue...

# A Primer on Gradient Boosted Decision Trees

Gradient boosted decision trees are an effective off-the-shelf method for generating effective models for classification and regression tasks.

Gradient boosting is a generic technique that can be applied to arbitrary 'underlying' weak learners - typically decision trees are used. The seminal reference is Friedman's 2001 paper that introduced the method.

Consider a typical supervised learning problem - we are given as input labeled feature vectors $(x, y)$, and seek to estimate a function $\hat F(x)$ that approximates the 'true' mapping $F^\star$, with $F^\star$ minimizing the expected loss $L(y, F(x)$ over some candidate functions $\mathcal{F}$ for a loss function $L$.

Continue...

# University of Sydney Mathematics Notes

This is a compilation of various sets of lecture notes I created during my Bachelors degree in Mathematics at the University of Sydney. All .tex files are available at the GitHub repository.

Continue...

# Elements of Statistical Learning - Chapter 2 Solutions

The Stanford textbook Elements of Statistical Learning by Hastie, Tibshirani, and Friedman is an excellent (and freely available) graduate-level text in data mining and machine learning. I'm currently working through it, and I'm putting my (partial) exercise solutions up for anyone who might find them useful. The first set of solutions is for Chapter 2, An Overview of Supervised Learning, introducing least squares and k-nearest-neighbour techniques.

### Exercise Solutions

See the solutions in PDF format (source) for a more pleasant reading experience. This webpage was created from the LaTeX source using the LaTeX2Markdown utility - check it out on GitHub.

Continue...