The LASSO Estimator

As far as I can tell, the LASSO estimator is the closest thing we have to a miracle in modern statistics.

The LASSO estimator is defined as a solution to the minimization problem $\frac{1}{n} \| Y - X \theta \|_2^2 + \lambda \| \theta \|_1$ over $\mathbb{R}^p$. The key insight here is that this is a convex problem in $\theta$ - this follows from both norms being convex and the sum of convex functions being convex. This allows us to design efficient solvers for this problem and thus handle large-scale problems - see, for example, ADMM, iterative shrinkage, gradient projection, etc.

The LASSO can be viewed as convex relaxation of a very natural problem in statistical estimation - finding the best $k$-sparse vector to minimize $\| Y - X \theta \| + \lambda \| \theta \|_0$, where the $L_0$ norm (indeed, not actually a norm) is to be interpreted as the number of non-zero coefficients in $\theta$. This comes from problems such as in signal processing and genomics array data where we have $p$ (the number of covariates) significantly larger than $n$, the number of observations. In this case, the usual least-squares estimation theory dating back to Gauss does not apply ($X$ cannot have full rank), and we must find other alternatives. The brute-force approach is combinatorially hard (we must check each $p \choose k$ sets of supports, which takes time exponential in $p$).

Thus, the LASSO objective can be seen as a natural convex relation of the original problem (e.g. taking $p$ from $0$ upwards and stopping as soon as $\| \theta \|_p$ is convex).

The "miracle" I refer to is the amazing result of Candes & Tao in a series of papers starting in 2005 that established that for a large class of observation matrices $X$, we have the amazing result that with very high probability, solving the LASSO problem is equivalent to solving the original combinatorially hard problem. Formally, we have the following theorem, which contains a germ of the restricted isometry property.

The Optimality of the LASSO estimator

Let $\theta_0$ be $k$-sparse with support $S_0$, and let $Y = X \theta + \epsilon$, with $\epsilon \sim N(0, \sigma^2 I_n)$. Let $\tilde \theta$ be the LASSO estimator of $(Y, X)$ with parameter $\lambda = 4 \overline \sigma \sqrt{\frac{t^2 + 2 \log p}{n}}$. Assume that $\| \tilde \theta_{S_0} - \theta_0 \|_1^2 \leq k r_0 (\tilde \theta - \theta_0)^T \hat \Sigma (\tilde \theta - \theta_0)$ with probability at least $1 - \beta$ for some $r_0$. Then with probability at least $1 - \beta - e^{-\frac{t^2}{2}}$, we have that \begin{equation} \frac{1}{n} \| X(\tilde \theta - \theta_0) \|_2^2 + \lambda \| \tilde \theta - \theta_0 \| \leq 4 k r_0 \lambda^2 \end{equation}


The proof is fairly elementary, requiring only a basic concentration of measure inequality for subgaussian random variables. The key step is recognizing that we can bound the event $\max_{j} \frac{2}{n} \| (\epsilon^T X)_j \| \geq \frac{\lambda}{2}$ on a set of measure less than $e^-\frac{t^2}{2}$. Once we have this result, we can apply the triangle inequality and the restricted isometry condition in the theorem to obtain the desired result.

blog comments powered by Disqus