Subscribe Bookmark



May 27, 2014

Overwhelmed by too many variables in predictive modeling?

In regression problems, we are often faced with choosing a set of predictor variables from what may be a very large set of candidate predictors. When building our model, we want to find a meaningful set of predictors that yields accurate predictions. Classical methods like forward selection and all-subsets regression do not always satisfy this goal.

Least Angle Regression (LARS) and penalized regression methods like the Least Absolute Shrinkage and Selection Operator (LASSO) and Elastic Net are newer, more promising approaches to variable selection. The R packages “lars” and “elasticnet” provide tools for fitting several of these newer techniques. The documentation included with the packages is a good starting point for learning more about the methods themselves. Using the new R integration functionality in JMP 9 to call these packages, we can present the results and build graphs in JMP. Check out the Penalized Regression Add-in (free SAS login required for access and download), which does just that.

When you install the Penalized Regression Add-in, you get three different tools for fitting these models. They appear under the “R Add-ins” heading. The “LASSO with cross-validation” uses cross-validation to choose the LASSO tuning parameter and allows you to see how parameter estimates change as you adjust the tuning parameter. The “Penalized Regression Method Comparison” tool lets us compare the LASSO, LARS and Forward Stagewise solutions. The results of these three methods are sometimes nearly identical. Last but not least, the “Elastic Net with cross-validation” plots the Elastic Net solution and uses cross-validation to choose the best combination of tuning parameters. The Elastic Net is tricky because it has two tuning parameters, but I found that the ability to watch the solution change as a function of the tuning parameters to be very helpful.

Now, let’s look at an example of using the LASSO with cross-validation. Using the diabetes data set that is included with the “lars” package, we get the results in the screenshot below. With these data we are trying to model the progression of diabetes using age, gender, body mass index, blood pressure, and several blood serum measurements.

The graph looks a little strange, but it contains a lot of information. It shows us how the parameter estimates change as a function of the tuning parameter. As we move the slider to increase the tuning parameter, the parameter estimate report on the right updates so that we can see when each variable enters the model.

Body mass index is the first variable to enter the model. We also find that age is the last variable to enter, which suggests that age may not play an important role in the progression of diabetes. The add-in allows us to look at the cross-validation results, which may help us choose the tuning parameter. Once we are happy with the tuning parameter, we can save the predicted values to our data table.

Community Member

Benjamin wrote:


Thanks for creating this. I've installed R and the required packages (lars and elasticnet), and verified that R recognizes the lars command. I also see your add-in in JMP 9.0.0 and the appropriate dialog comes up. However, after selecting data and running, nothing happens - the dialog just closes, and I verified that the CPU isn't up to anything. I've tried with and without R already open and after restarting JMP. Any suggsetions?



Community Member

Clay Barker wrote:

Hi Gabriel, this is a very good question. In general, we can't really say which technique will perform better for a particular data set. So it is probably worthwhile to try both techniques and compare the results. One potential advantage to using something like LARS or the LASSO to build your prediction model is that the resulting linear model is often easier to interpret.

Community Member

Gabriel Andraos wrote:

Dear Clay. Thanks for this clear presentation. How are these techniques better than using the partition platform in JMP 9.0 for the purpose of selecting the most predictive variables?


Community Member

Anders wrote:

I have the same problem as Benjamin. Almost threw my computer out the window. This add-in does not work for me. To make matters worse, there's no bloody documentation on how to get this add-in to work! Very frustrated because I have to do a logistic regression and have a huge number of variables and want to avoid Stepwise.

Clay Barker wrote:

Benjamin and Anders: Iâ m sorry you are having troubles. Unfortunately, the problem will be difficult to pinpoint. Because the add-in calls R, the problem could be with the R install, the R packages that are being called, or the add-in. Here are a few suggestions that may help you:

1. Make sure that there aren't missing values in your data set. Since the add-in isn't meant to be a fully functioning platform, I wasn't able to support missing values. So impute any missing values before calling the add-in.

2. For the same reasons as #1, the add-in ignores row-states (like excluded rows).

3. The add-in is only designed to do least squares. If you are interested in taking a similar approach to logistic regression, a similar interface to R could be built to call the glmnet package for fitting penalized generalized linear models.

Are there any other JMP users out there who have suggestions?