Subscribe Bookmark
clay_barker

Staff

Joined:

May 27, 2014

Why a penalty is good in generalized regression

A penalty can seriously ruin your day. Forget to pay a bill on time, and a late penalty will cost you a few more dollars. When a yellow flag hits the football field, a penalty can cost your favorite team field position, momentum and maybe even some points. But a penalty isn't always such a bad thing. In fact, when building a regression model, a penalty is a very good thing. This post introduces the Generalized Regression platform (new in JMP Pro 11), which allows us to fit penalized regression models. We have seen how to fit similar models using an add-in, but the Generalized Regression platform provides much more functionality built into JMP Pro.

Maximum likelihood is one of the workhorses of statistics and is a popular way to estimate the parameters in a regression model. The likelihood function tells us the probability of the observed responses for a given set of parameters. So the maximum likelihood estimator gives us the regression parameters that maximize the probability of the observed data. Likelihood theory also provides us theory for making inferences about our regression parameters. Pretty appealing, right? Maximum likelihood is extremely powerful, but it is most appropriate when we know ahead of time which predictors belong in our model.

Avoiding overfitting

When we have dozens of predictors or more and don't know which to include in our model, maximum likelihood tends to overfit and is no longer such a great choice. Overfitting means that we will fit our observed data well, but we will do a poor job of predicting new observations. Penalized regression techniques can help us avoid overfitting when we are overwhelmed by a large set of predictors. By penalized regression we mean that instead of maximizing our usual likelihood function, we will optimize a function of the likelihood plus a penalty on the magnitude of the parameters. Optimizing the penalized likelihood makes intuitive sense: We want to fit the data well, but we don’t want the model to get too big. The Generalized Regression platform in JMP Pro allows us to fit three popular penalized regression models: Ridge Regression, the Lasso, and the Elastic Net. These three estimation techniques allow us to build prediction models that will predict well for new observations. The Lasso and the Elastic Net also perform variable selection so that we end up with a model that is easier to interpret.

Rather than trying to cover all of the estimation techniques available in the Generalized Regression platform, let's just focus on the Elastic Net. The Elastic Net does variable selection and parameter estimation simultaneously. And as an added bonus, it handles correlated predictors (multicollinearity) well. Large observational data sets often suffer from multicollinearity, so the Elastic Net is a good choice when working with observational data. The Elastic Net penalizes the likelihood using a mix of the L1 and L2 norms:

where the betas are our regression coefficients. The L1 piece of the penalty results in variable selection while the L2 piece of the penalty helps to deal with collinearity. Here alpha and lambda are called tuning parameters, and each combination of the two leads to a different set of regression parameter estimates. JMP provides several options for choosing the best combination of the tuning parameters based on goodness of fit measures like cross-validation or the BIC.

Hollywood movies example

Let's try out the Generalized Regression platform with an example from JMP's sample data folder. The Hollywood Movies data set gives us information about the profitability of a sample of movies from 2011. Our goal is to predict the profitability of a movie based on several factors: Rotten Tomatoes Score, Audience Score, Lead Studio (33 different studios in total) and Genre (romance, horror, comedy and so on). It seems very unlikely that all of these predictors would truly have an impact on profitability, so we want to do variable selection. And we should be concerned about Audience Score and Rotten Tomatoes Score being highly correlated, since audiences tend to agree with critics. So the Elastic Net is a perfect choice for fitting these data because it handles correlated predictors, and it does variable selection. My wife and I enjoy going to the movies and often check Rotten Tomatoes scores before choosing a movie, so this is one of my favorite data sets for showing off JMP.

To get to the Generalized Regression platform, we launch Fit Model and choose Generalized Regression from the Personality menu (as in Figure 1). After specifying our columns and completing the launch dialog, we choose the estimation and validation methods as in Figure 2. We'll stick with the defaults: Elastic Net with BIC validation.

Figure 1: Launching the Generalized Platform

Figure 2: Model Launch Dialog

Figure 3 shows part of the Generalized Regression report, with the parameter estimate table sorted so that it is easier to spot the nonzero coefficients. The solution path plot gives us information about how different variables entered the regression model. Each line in the path represents a variable in our model and as the penalty is decreased (moving from left to right in the graph), more and more variables enter the model.

The Parameter Estimates report shows us that only three predictors have nonzero coefficients in our final model. Apparently, it was a bad year for dramas and a good year for movies released by Sony. And movies that were popular with the critics tended to be more profitable, which is probably what we would expect. We started with approximately 40 terms in our model and narrowed it down to just three. This leaves us with a model that is easy to interpret and that will predict well for new observations.

Figure 3: Partial Report for the Hollywood Movies Example

I am very excited about the Generalized Regression platform in JMP 11. Variable selection is one of the most important problems in statistics, and now JMP Pro has more powerful tools for attacking that problem. And did you notice in Figure 1 that we chose the Gamma distribution? As the name of the platform suggests, the Generalized Regression platform allows us to fit generalized linear models. That means that we can do variable selection even when the response variable is not normally distributed. For the movie data example, the gamma distribution is a good choice because we would expect profitability to have a skewed distribution. The Generalized Regression platform supports a variety of distributions on the response: normal, binomial (logistic regression), Poisson, negative binomial, and more. So the next time you have to do variable selection, try out the Elastic Net or the Lasso in the Generalized Regression platform.

1 Comment
Community Member

Mike Clayton wrote:

Thanks. I think the JMP Blog could use a good discussion of the "process" stories that typically generate each of the major distribution types. For example, measuring particles on glass or silicon wafers tends to be very skewed distribution, with no negative values. However, measuring particles added or removed with before and after data, has both positive and negative values. And simply Poisson distributions or Log-Normal distributions may not apply in some cases for particulate data. One researcher suggested the Reynolds distribution for example. Each process story can generate a list of likely distributions. Factory data can be mostly continuous, while Service data tends to mixed an often categorical or count data, sometime percentages.