Improving PyTorch inference performance on GPUs with a few simple tricks

We hear a lot these days that "machine learning systems are stuck in a rut". This is might be expanded out as "standard architectures run at high efficiencies due to disproportionate engineering effort, while theoretically better but non-standard architectures run at low efficiencies due to the lack of specialized engineering work on these alternatives and the failure of our systems to provide generalized performance."

This is a reasonable thesis. But as long as it is the status quo, we may as well enjoy the empirical efficiencies from standard architectures. That is, less "stuck-in-a-rut" thinking and more "pigs-at-a-trough" thinking! If you're going to use vanilla architectures and not play with the fanciest modern variants, you may as well take advantage of their practical efficiency.

Out of the box, PyTorch doesn't always provide the best performance for these models. There are good (and bad) reasons for this, but regardless it's generally very simple to achieve very high roofline efficiency during inference with a few simple steps.

Here, I'll just cover two quick examples: convolutional neural networks (where ResNet-101 is an exemplar rut architecture) and Transformer encoders (where BERT-Large is an exemplar rut architecture), run in float16 on NVIDIA GPUs. This is mostly focusing on the sweet spot of modern ML systems - large(ish) models, NVIDIA GPUs, half-precision, reasonably large batch sizes – and with a little work the results are great.


Cambridge Part III Mathematics Notes

I've cleaned up (somewhat) my notes from Cambridge Part III and have put them online - with LaTeX sources available on GitHub and PDFs linked below.

Advanced Financial Models

Advanced Probability

Applied Bayesian Statistics

Convex Optimization

Mathematics Of Operations Research

Non-Parametric Statistics


Ramsay Theory

Statistical Theory

Time Series and Monte Carlo Analysis

The LASSO Estimator

As far as I can tell, the LASSO estimator is the closest thing we have to a miracle in modern statistics.

The LASSO estimator is defined as a solution to the minimization problem $\frac{1}{n} \| Y - X \theta \|_2^2 + \lambda \| \theta \|_1$ over $\mathbb{R}^p$. The key insight here is that this is a convex problem in $\theta$ - this follows from both norms being convex and the sum of convex functions being convex. This allows us to design efficient solvers for this problem and thus handle large-scale problems - see, for example, ADMM, iterative shrinkage, gradient projection, etc.

The LASSO can be viewed as convex relaxation of a very natural problem in statistical estimation - finding the best $k$-sparse vector to minimize $\| Y - X \theta \| + \lambda \| \theta \|_0$, where the $L_0$ norm (indeed, not actually a norm) is to be interpreted as the number of non-zero coefficients in $\theta$. This comes from problems such as in signal processing and genomics array data where we have $p$ (the number of covariates) significantly larger than $n$, the number of observations. In this case, the usual least-squares estimation theory dating back to Gauss does not apply ($X$ cannot have full rank), and we must find other alternatives. The brute-force approach is combinatorially hard (we must check each $p \choose k$ sets of supports, which takes time exponential in $p$).

Thus, the LASSO objective can be seen as a natural convex relation of the original problem (e.g. taking $p$ from $0$ upwards and stopping as soon as $\| \theta \|_p$ is convex).

The "miracle" I refer to is the amazing result of Candes & Tao in a series of papers starting in 2005 that established that for a large class of observation matrices $X$, we have the amazing result that with very high probability, solving the LASSO problem is equivalent to solving the original combinatorially hard problem. Formally, we have the following theorem, which contains a germ of the restricted isometry property.

The Optimality of the LASSO estimator

Let $\theta_0$ be $k$-sparse with support $S_0$, and let $Y = X \theta + \epsilon$, with $\epsilon \sim N(0, \sigma^2 I_n)$. Let $\tilde \theta$ be the LASSO estimator of $(Y, X)$ with parameter $\lambda = 4 \overline \sigma \sqrt{\frac{t^2 + 2 \log p}{n}}$. Assume that $\| \tilde \theta_{S_0} - \theta_0 \|_1^2 \leq k r_0 (\tilde \theta - \theta_0)^T \hat \Sigma (\tilde \theta - \theta_0)$ with probability at least $1 - \beta$ for some $r_0$. Then with probability at least $1 - \beta - e^{-\frac{t^2}{2}}$, we have that \begin{equation} \frac{1}{n} \| X(\tilde \theta - \theta_0) \|_2^2 + \lambda \| \tilde \theta - \theta_0 \| \leq 4 k r_0 \lambda^2 \end{equation}


The proof is fairly elementary, requiring only a basic concentration of measure inequality for subgaussian random variables. The key step is recognizing that we can bound the event $\max_{j} \frac{2}{n} \| (\epsilon^T X)_j \| \geq \frac{\lambda}{2}$ on a set of measure less than $e^-\frac{t^2}{2}$. Once we have this result, we can apply the triangle inequality and the restricted isometry condition in the theorem to obtain the desired result.

A modern Emacs setup in OS X

About a year ago I switched from Vim to Emacs, and I couldn't be happier about the move. I spent some time getting a setup I was happy with, and thought I'd share it for those who are also looking to move to Emacs.

For more information, my entire .emacs.d is available in my dots repository on GitHub.


  • emacs

A Parallel Boggle Solver in Haskell

A cute interview question I've had is "given an $n \times n$ board of characters and a dictionary, find all possible words formed by a self-avoiding path in the grid". This is otherwise known as "Boggle".

A simple solution is just conducting a graph traversal of the game board - for each of the $n^2$ positions, we conduct a DFS search starting at that position, tracking the previously visited positions at each stage.

This is trivial to do in any language, but I thought it would be an interesting application of Simon Marlow's Control.Monad.Parallel Haskell library to conduct these traversals in parallel.

The full implementation is available on GitHub at


Python vs Julia - an example from machine learning

In Speeding up isotonic regression in scikit-learn, we dropped down into Cython to improve the performance of a regression algorithm. I thought it would be interesting to compare the performance of this (optimized) code in Python against the naive Julia implementation.

This article continues on from the previous one, so it may be worth reading that before continuing here to obtain the necessary background information.

We'll implement both of the algorithms for the previous article, and compare their performance in Julia against Python.


Speeding up isotonic regression in scikit-learn by 5,000x

Isotonic regression is a useful non-parametric regression technique for fitting an increasing function to a given dataset.

A classic use is in improving the calibration of a probabilistic classifier. Say we have a set of 0/1 data-points (e.g. ad clicks), and we train a probabilistic classifier on this dataset.

Unfortunately, we find that our classifier is poorly calibrated - for cases where it predicts ~50% probability of a click, there is actually a 20% probability of a click, and so on.

In this case, we can learn an isotonic regression model on the output of the classifier, where our increasing function we fit is $\mathcal{P}(+ , | , \text{classifiers prediction})$. The constraint that the function is increasing means that the ordering of events is preserved by the transformation, which is an important constraint.

With a trained isotonic regression model, our final output is the composition of the classifiers prediction with the isotonic regression function.

For an example of this usage, see the Google Ad Click Prediction - A View from the Trenches paper from KDD 2013, which covers this technique in section 7. The AdPredictor ICML paper paper also uses this technique for calibrating a Naive Bayes predictor.

We'll now detail how we made the scikit-learn implementation of isotonic regression more than ~5,000x faster, while reducing the number of lines of code in the implementation.


Stripe CTF Distributed Systems - Minimal Solutions

The Stripe CTF Distributed Systems edition has just finished, and I passed all the levels (and was one of the first fifty contestants to finish). In constructing my solutions, I thought it would be an interesting challenge to attempt to construct the minimal changes to the reference solutions that are sufficient to pass the scoring requirements.

To be clear - these aren't high-quality (or high-scoring) solutions. I'm not especially proud of these. They are just small solutions, and somewhat interesting for that reason.


The Performance of Decision Tree Evaluation Strategies

UPDATE: Compiled evaluation is now implemented for scikit-learn regression tree/ensemble models, available at or pip install sklearn-compiledtrees.

Our previous article on decision trees dealt with techniques to speed up the training process, though often the performance-critical component of the machine learning pipeline is the prediction side. Training takes place offline, whereas predictions are often in the hot path - consider ranking documents in response to a user query a-la Google, Bing, etc. Many candidate documents need to be scored as quickly as possible, and the top k results returned to the user.

Here, we'll focus on on a few methods to improve the performance of evaluating an ensemble of decision trees - encompassing random forests, gradient boosted decision trees, AdaBoost, etc.

There are three methods we'll focus on here:

  • Recursive tree walking (naive)
  • Flattening the decision tree (flattened)
  • Compiling the tree to machine code (compiled)

We'll show that choosing the right strategy can improve evaluation time by more than 2x - which can be a very significant performance improvement indeed.

All code (implementation, drivers, analysis scripts) are available on GitHub at the decisiontrees-performance repository.


Online Learning with Microsoft's AdPredictor algorithm

Online learning (as opposed to more traditional batched machine learning) is more and more commonly applied to training machine learned models at scale. The chief advantage is that the model is trained via a streaming approach, and thus the entire dataset used when training does not need to be held in memory at any given time.

That is to say, we can consider that a model parameters have a current state $\mathbf{w}$, and we observe our examples $(y, \mathbf{x})$ with ($y$ the label and $x$ the features of the given example) in a streaming fashion. At each example, we update our weights from the given example, and these weights are used as a starting point.

Microsoft's AdPredictor model from ICML 2010 is an online learning model that has been successfully applied in the context of click prediction for online advertising.

Here, we'll implement the AdPredictor algorithm (in Python), and demonstrate how online learning works via visualizations of the trained parameters $\mathbf{w}$.


A basic soft-margin kernel SVM implementation in Python

Support Vector Machines (SVMs) are a family of nice supervised learning algorithms that can train classification and regression models efficiently and with very good performance in practice.

SVMs are also rooted in convex optimization and Hilbert space theory, and there is a lot of beautiful mathematics in the derivation of various aspects of the training algorithm, which we will go into in subsequent posts.

For now, we'll just give an introduction to the basic theory of soft-margin kernel SVMs. The classical treatment is to start with hard-margin linear SVMs, then introduce the kernel trick and the soft-margin formulation, so this is somewhat faster-moving than other presentations.


Consistency of M-estimators

Let $\Theta \subseteq \mathbb{R}^{p}$ be compact. Let $Q: \Theta \rightarrow \mathbb{R}$ be a continuous, non-random function that has a unique minimizer $\theta_{0} \in \Theta$.

Let $Q_{n}: \Theta \rightarrow \mathbb{R}$ be any sequence of random functions such that

\begin{equation} \sup_{\theta \in \Theta} |Q_{n}(\theta) - Q(\theta)| \rightarrow 0 \end{equation} as $n \rightarrow \infty$.

If $\theta_{n}$ is any sequence of minimizers of $Q_{n}$, then $\hat \theta_{n} \rightarrow \theta_{0}$ in probability as $n \rightarrow \infty$.

  • statistics

Speeding up Decision Tree Training

The classic algorithm for training a decision tree for classification/regression problems (CART) is well known. The underlying algorithm acts by recursively partitioning the dataset into subsets that maximize the 'clustering' of examples in each of the partitioned subsets, where the metric used for clustering varies depending on the problem (for example, information gain, Gini loss, etc, have been used successfully in the literature).

For a high level overview of the algorithm, see the following snippet of Haskell code code from the haskell-ml project project.


The Ito isometry

The Itō isometry is a useful theorem in stochastic calculus that provides a fundamental tool in computing stochastic integrals - integrals with respect to a Brownian motion \begin{equation} \int_{0}^{\infty} f(s) dB_{s} \end{equation} with $B_{s}$ a Brownian motion.

First, we'll define a predictable process.


  • statistics
  • mathematics

Purely Functional Tree Search in Haskell

Haskell is an absolute pleasure to write code in, and I've been trying to use it more and more. It's a language that rewards extended effort in a way that C++ et. al. do not.

Consider the following program, illustrating a basic BFS/DFS search through a tree in Haskell. It illustrates a number of useful concepts - algebraic data types, type classes, and QuickCheck.


A Primer on Gradient Boosted Decision Trees

Gradient boosted decision trees are an effective off-the-shelf method for generating effective models for classification and regression tasks.

Gradient boosting is a generic technique that can be applied to arbitrary 'underlying' weak learners - typically decision trees are used. The seminal reference is Friedman's 2001 paper that introduced the method.

Consider a typical supervised learning problem - we are given as input labeled feature vectors $(x, y)$, and seek to estimate a function $\hat F(x)$ that approximates the 'true' mapping $F^\star$, with $F^\star$ minimizing the expected loss $L(y, F(x)$ over some candidate functions $\mathcal{F}$ for a loss function $L$.


University of Sydney Mathematics Notes

This is a compilation of various sets of lecture notes I created during my Bachelors degree in Mathematics at the University of Sydney. All .tex files are available at the GitHub repository.


Elements of Statistical Learning - Chapter 2 Solutions

The Stanford textbook Elements of Statistical Learning by Hastie, Tibshirani, and Friedman is an excellent (and freely available) graduate-level text in data mining and machine learning. I'm currently working through it, and I'm putting my (partial) exercise solutions up for anyone who might find them useful. The first set of solutions is for Chapter 2, An Overview of Supervised Learning, introducing least squares and k-nearest-neighbour techniques.

Exercise Solutions

See the solutions in PDF format (source) for a more pleasant reading experience. This webpage was created from the LaTeX source using the LaTeX2Markdown utility - check it out on GitHub.