Speeding up isotonic regression in scikit-learn by 5,000x
Isotonic regression is a useful non-parametric regression technique for fitting an increasing function to a given dataset.
A classic use is in improving the calibration of a probabilistic classifier. Say we have a set of 0/1 data-points (e.g. ad clicks), and we train a probabilistic classifier on this dataset.
Unfortunately, we find that our classifier is poorly calibrated - for cases where it predicts ~50% probability of a click, there is actually a 20% probability of a click, and so on.
In this case, we can learn an isotonic regression model on the output of the classifier, where our increasing function we fit is $\mathcal{P}(+ , | , \text{classifiers prediction})$. The constraint that the function is increasing means that the ordering of events is preserved by the transformation, which is an important constraint.
With a trained isotonic regression model, our final output is the composition of the classifiers prediction with the isotonic regression function.
For an example of this usage, see the Google Ad Click Prediction - A View from the Trenches paper from KDD 2013, which covers this technique in section 7. The AdPredictor ICML paper paper also uses this technique for calibrating a Naive Bayes predictor.
We'll now detail how we made the scikit-learn
implementation of isotonic regression more than ~5,000x faster, while
reducing the number of lines of code in the implementation.
The Pooled Adjacent Violators Algorithm
The pooled adjacent violators algorithm (PAVA) is a useful algorithm for fitting weighted isotonic regressions to data.
PAVA is a linear-time algorithm for fitting an isotonic regression model. There is a nice visualization and explanation at Fabian Pedregosa's blog.
The algorithm sweeps through the data looking for violations of the monotonicity constraint. When it finds one, it adjusts the estimate to the best possible fit with constraints. Sometimes it also needs to modify previous points to make sure the new estimate does not violate the constraints.
See Isotone Optimization in R for a more formal introduction to PAVA and active set methods for isotone regression.
Speeding up Isotonic Regression
I wrote the current scikit-learn implementation after seeing the previous implementation was much slower than equivalent implementations in R, etc.
The original algorithm used the active set, which is in some sense mathematically dual to PAVA, and seemed to scale quite slowly (approximately $\mathcal{O}(N^2)$). This is despite being implemented in Cython, which compiles Python-like code to C, and is generally very fast. After some profiling, it turned out this slowness was due to the $\mathcal{O}(N)$ list pop in the inner loop (lines 29 and 39 of the below gist)
Additionally, the implementation was not amenable to efficient machine code generation, given the large number of calls to the native Python in the generated Cython code. In the below image, white lines are essentially translated to C, while yellow lines call into the Python API, and are thus much more expensive.
Additionally, the algorithm allocates a significant amount of memory,
due to maintaining and updating the auxiliary data structure
active_sets
.
Thus, it seemed promising to seek to improve the performance of this
code. Profiling (using Instruments.app
on OS X) indicated that the
heavy memory allocation and the list pop
were the key causes of
slowdown, so an efficient Cython implementation of PAVA seemed a
natural choice. PAVA can be implemented entirely in-place (so no
additionally memory allocations are needed) and does not require any
operations beyond indexing into arrays, so can be efficiently compiled
by Cython.
Once a simple benchmarking script had been written (to quickly validate performance improvements), we set about implementing PAVA. The PAVA implementation is gisted below.
A few interesting aspects of this approach:
- No memory allocations are made - mutations are made in-place to the
solution
vector. - The algorithm is essentially pure C with a Python syntax - the type information available to the compiler means that the generated code is as efficient as a pure C implementation.
- This algorithm is essential $\mathcal{O}(N^2)$ - we don't use the optimization of collapsing subsequences into a single point, which is required to achieve $\mathcal{O}(N)$ asymptotics. Experiments indicated this optimization didn't improve performance.
The output of cython --annotate
indicates that we have effectively
eliminated all Python API calls, and are being translated into
straightforward C.
Performance
Benchmark results indicate the simple PAVA algorithm performs much faster - approximately 5,000x faster with 1,000,000 data-points, approximately 500x faster with 100,000 data-points, and 14x faster with 1,000 data-points.
On a log-log scale, the performance improvements are visualized below for two separate datasets - a randomly perturbed version of $\log(1 + x)$, and a simulated univariate logistic regression dataset.
The Importance of Being Typed
It is also instructive to examine the effect that Cython type
information has on the performance of the PAVA implementation.
Stripping the cdef
block yields the following comparative
performance:
Thus, stripping out the type annotations can makes the same Cython algorithm ~100x-500x slower.
There's a lesson here in the importance of constant factors when doing performance-critical work - and another one on understanding all stages of your tool-chain.
A ~100x-500x performance hit from not adding type information is incredibly high, and is both a reflection of the speed of straight-line C code, and the cost of falling back to the Python interpreter.
More Information
See the GitHub pull request for more information and discussion.