The Quiet Revolution in Statistics

A somewhat dramatic shift in statistical methodology is underfoot. For regression-type models the basic result is that Bayesian shrinkage produces lower fitting and prediction errors than does MLE. There have been several larger shifts going on that drive this:

Bayesians are becoming less Bayesian and frequentists are becoming more Bayesian. A meeting point is coming into focus.
Shrinking fitted values towards the overall mean improves predictive accuracy much like it does in credibility theory.
Goodness of fit is no longer measured by how well the model fits the data, but rather by how well the model estimated by leaving out some data predicts the omitted observations.
New estimation methods do not optimize any goodness-of-fit measures, which probably improves their predictive performance out of sample.

Bayesians becoming less Bayesian and frequentists becoming more

There is such a shift in how Bayesians work these days that I want to get rid of the terms “prior,” “posterior,” and “Bayes Theorem.” The way it goes now is that when you propose a model, you specify the unconditional distribution for each parameter. Then the MCMC estimation machinery simulates the distribution of the conditional distribution of the parameters given the data, including all their correlations. The definition of the conditional distribution is what is called Bayes Theorem, so it’s not really a theorem. And the MCMC simulation works from just the likelihood times the unconditional parameter distribution. If there are problems with the resulting fit, they conclude they had specified the wrong unconditional distribution and try another one, informed by how this one came out. In the old days, the prior represented what you believed before seeing the data, so it wouldn’t do to change it after you saw the implied posterior. Now it is like a frequentist looking at model residuals and deciding that the assumed distribution was not the right one. And you used to pick out a prior for which the posterior had the same form – called conjugate priors. Now you don’t specify anything about the conditional distribution – it just comes out whatever it is, numerically.

While this was going on, frequentists came out with random effects. An effect is what a Bayesian would call a parameter, but the frequentists want parameters to be fixed numbers, without a distribution. And they don’t estimate effects, they project them. So it is similar to the Bayesian setup, but with different terminology, and probably with some different underlying concepts. They specify an initial unconditional distribution of the effects. Interestingly, a popular way to project the effects is to maximize the joint likelihood. This is the probability of the effects times the likelihood. So it is the same thing that MCMC uses to compute the conditional distribution of the parameters. But frequentists don’t go quite that far – they just maximize the joint likelihood. So that is finding the mode of the conditional distribution.

So the difference comes down to that Bayesians estimate the conditional distribution of the parameters given the data, and frequentists just use the mode of that distribution. Some Bayesians use the mode too, but their tendency is to use the mean. The mode is a kind of optimization of the conditional goodness of fit, and Bayesian tend to be suspicious that doing so risks responding too strongly to peculiarities of the sample that would not hold up for projection. The mean does not optimize a goodness-of-fit measure. Giving up optimizing something was the hardest step for me in adopting this new methodology.

My proposed meeting point is that the Bayesians drop the vocabulary of prior and posterior, and the frequentists accept the conditional distribution of the effects given the data, as computed by MCMC, as a legitimate thing they can use. Mode vs. mean then becomes more of a detail that could have opposing viewpoints, until it is clearly established one way or another. And they could all still argue about the meaning of probability.

Shrinkage towards the mean

Credibility improves model prediction by shrinking estimates towards the overall mean. The James-Stein estimator is the exact same approach as least-squares credibility, and predates it. It has been shown to reduce estimation and prediction errors when you are estimating 3 or more means. In regression you are estimating a mean value for every observation, but the within and between variances needed for credibility might not be known. In 1970 the method of ridge regression was developed for this. The independent variables first get an additive and multiplicative adjustment to make them all mean zero, variance one. The coefficients and constant term account for these adjustments. Then what is minimized is the NLL plus a selected percentage l of the sum of squared coefficients (except for the constant). This shrinks the coefficients towards zero to some degree. Every fitted value is the constant plus a weighted average of mean-zero variables, so shrinking the coefficients shrinks the estimates towards the overall mean, just like in James-Stein and credibility. What was proved in 1970 is that there is always some l > 0 that gives a fit with lower error variances than from MLE.

Later lasso changed this a bit to use the sum of the absolute values of the coefficients instead of their squares. This actually results in shrinking some coefficients exactly to zero, which eliminates those variables. Thus lasso became popular for being both a variable selection method and a way to error reduction.

Bayesian shrinkage postulates parameter distributions with mean zero. With the normal, the conditional mode is the ridge regression estimate. The same thing results from random effects. Lasso comes out if you do this with the double exponential, or Laplace distribution of the parameters. That is called Bayesian lasso.

Goodness of fit

Measures like AIC start with how well the model represents the data, as expressed by the negative loglikelihood (NLL). Then they add a penalty for how many parameters were used. The goal is to eliminate sample bias, so they can estimate what the NLL of this model would be for a new sample. But shrunk parameters don’t produce as much sample bias as full parameters do, as they respond less strongly to the data. Thus AIC etc. doesn’t work for shrinkage.

This has been a problem for shrinkage models from the beginning. It makes it hard to know what values of l to use for a particular problem. What evolved was cross validation – leaving out some data and seeing how well it is predicted. Usually the sample is divided into groups, and the groups are left out one at a time, and the NLLs for the left-out groups are summed to be the fit measure to pick l. Eventually making each point its own omitted group came to be viewed as a good test. This is leave one out, or loo, cross validation.

Of course that is a lot of estimation effort. What was eventually worked out in the Bayesian case was that you could get a reasonable estimate of a point’s likelihood in a fit that left it out by fitting all the points, and giving each point an adjusted likelihood equal to a weighted average of its likelihood in all the parameter samples, with greater weight going to the samples where it was fit worse. This derived from a numerical integration routine called importance sampling. Unfortunately, however, this estimate is noisy, especially for the worst fit points. Just recently an adjustment was developed, using a Pareto distribution fit to the worst samples for a point, called Pareto-smoothed importance sampling. This worked well and is now an R app for MCMC output.

The loo goodness-of-fit measure can be computed efficiently, and so now gives Bayesian shrinkage a convenient way to compare models.

Not optimizing anything

We already saw that the conditional mean does not optimize a goodness-of-fit measure. But what about for l? The fully Bayesian approach is to also specify an initial distribution for l, and let MCMC work out its conditional distribution given the data and the rest of the model. I’ve tried this and it usually gets close to selecting the optimal l by loo, and sometimes is even a little better than the best single value of l would produce. The problems that can arise is that not allowing enough shrinkage can sometimes create convergence problems for MCMC. If the initial distribution is too tight, the conditional distribution can be concentrated near one end or the other. The original distribution assumption can be adjusted for this. Sometimes, forcing too much shrinkage can get convergence in the center of the assumed l distribution, but the fit is not so good. Trying more shrinkage ranges can address this.

In the fully Bayesian approach, then, you don’t have to try the model with a lot of values of l. You can fit directly, just like with MLE, and get a good degree of shrinkage and a goodness of fit measure. But you do have to test to make sure l is not forced to be in a suboptimal range. The estimates appear to be stable with reasonable changes in the assumed range for l.

With all this, there is little reason to use MLE for regression and GLM models, and their generalizations. Bayesian shrinkage gives better estimates and can be applied in a simple, direct manner.