Overwhelmed by too many variables in predictive modeling?
Mar 9, 2011 3:52 PM
In regression problems, we are often faced with choosing a set of predictor variables from what may be a very large set of candidate predictors. When building our model, we want to find a meaningful set of predictors that yields accurate predictions. Classical methods like forward selection and all-subsets regression do not always satisfy this goal.
Least Angle Regression (LARS) and penalized regression methods like the Least Absolute Shrinkage and Selection Operator (LASSO) and Elastic Net are newer, more promising approaches to variable selection. The R packages “lars” and “elasticnet” provide tools for fitting several of these newer techniques. The documentation included with the packages is a good starting point for learning more about the methods themselves. Using the new R integration functionality in JMP 9 to call these packages, we can present the results and build graphs in JMP. Check out the Penalized Regression Add-in (free SAS login required for access and download), which does just that.
When you install the Penalized Regression Add-in, you get three different tools for fitting these models. They appear under the “R Add-ins” heading. The “LASSO with cross-validation” uses cross-validation to choose the LASSO tuning parameter and allows you to see how parameter estimates change as you adjust the tuning parameter. The “Penalized Regression Method Comparison” tool lets us compare the LASSO, LARS and Forward Stagewise solutions. The results of these three methods are sometimes nearly identical. Last but not least, the “Elastic Net with cross-validation” plots the Elastic Net solution and uses cross-validation to choose the best combination of tuning parameters. The Elastic Net is tricky because it has two tuning parameters, but I found that the ability to watch the solution change as a function of the tuning parameters to be very helpful.
Now, let’s look at an example of using the LASSO with cross-validation. Using the diabetes data set that is included with the “lars” package, we get the results in the screenshot below. With these data we are trying to model the progression of diabetes using age, gender, body mass index, blood pressure, and several blood serum measurements.
The graph looks a little strange, but it contains a lot of information. It shows us how the parameter estimates change as a function of the tuning parameter. As we move the slider to increase the tuning parameter, the parameter estimate report on the right updates so that we can see when each variable enters the model.
Body mass index is the first variable to enter the model. We also find that age is the last variable to enter, which suggests that age may not play an important role in the progression of diabetes. The add-in allows us to look at the cross-validation results, which may help us choose the tuning parameter. Once we are happy with the tuning parameter, we can save the predicted values to our data table.