Building Predictive Models for Correlated and High Dimensional Data

2 Kudos

Multicollinearity is the existence of such a high degree of correlation between supposedly independent variables being used to estimate a dependent variable that the contribution of each independent variable to variation in the dependent variable cannot be determined. High multicollinearity in your data set means any model you build will almost certainly be overfit unless you use techniques to mitigate this issue. There are several modeling techniques that can be used to improve the likelihood of building a good predictive model by limiting multicollinearity. These include Principal Components Analysis and Partial Least Squares, which are both available in JMP, and Generalized Regression, which is available in JMP Pro as a Fit Model option.

Gen Reg Fit Model.JPG

See how to:

Perform Principal Components Analysis (PCA) to use a small number of components to model a set of data by limiting the variability and mitigating multicollinearity
- The best data sets for PCA analysis are either tall or wide. Could be both
- Does not interpret variables as inputs or outputs - deals in a single matrix
- Extract linear combinations of variables that explain as much variability as possible
  - This all starts with the 1st Principal Component and the algorithm continues until 100% of the variation is explained. Hopefully this is a number much small than your original set of variables
  - Each successive PC explains as much variation as possible and is orthogonal to the loading vectors of the previously extracted
- Examine and interpret Summary Plots and Eigen Values
- Save Principal Components to data table and perform Principle Component regression using Graph>Profiler (Reminder: Check Expand intermediate values box)
Use Partial Least Squares to fit linear models based on factors, namely, linear combinations of the explanatory variables (Xs)
- These factors are obtained in a way that attempts to maximize the covariance between the Xs and the response or responses (Ys).
- PLS exploits the correlations between the Xs and the Ys to reveal underlying latent structures
- PLS performs well in situations where the use of ordinary least squares (OLS) does not produce satisfactory results, including when there are more X variables than observations; highly correlated X variables; a large number of X variables; and several Y variables and many X variables.
- PLS can be used when the predictors outnumber the observations
- PLS is used widely in modeling high-dimensional data in areas such as spectroscopy, chemometrics, genomics, psychology, education, economics, political science, and environmental science.
- See how to use JMP PLS for two model fitting algorithms are available: nonlinear iterative partial least squares (NIPALS) and a “statistically inspired modification of PLS” (SIMPLS)
  - For a single response, both methods give the same model; for multiple responses, there are slight differences
- PLS uses the van der Voet T2 test and cross validation to help you choose the optimal number of factors to extract.
- Interpret the Root Mean PRESS (predicted residual sum of squares) Plot, which shows the number of factors with the lowest PRESS value, that model being the least of the better models you'll get for the data
- Use the Variance Importance Plot (VIP) to locate factors to remove
- Interpret Percent Variation Plots
  - The van der Voet T2 statistic tests to determine whether a model with a different number of factors differs significantly from the model with the minimum PRESS value
  - A common practice is to extract the smallest number of factors for which the van der Voet significance level exceeds 0.10
Use JMP Pro Generalized Regression advanced penalized regression technique(s) that are especially good for highly correlated and/or non-normally distributed data and variable selection
- Build LASSO model, interpret results and understand some of its limitations.
- Build Elastic Net model and interpret for highly-correlated data and understand some of the benefits over LASSO
Use JMP Pro Generalized Regression for high-dimensional spectral data
- Examine and interpret PCA Model Driven Multivariate Control Charts, PCA on Correlations Summary Plots and Eigen Values
- Examine and interpret PLS results
- Interpret Elastic Net Generalized Regression including Generalized R-squared
Use JMP Pro Generalized Regression for very wide data and interpret results

Note: the attached .jmpprj file includes scripts to run the analyses from the demo.

Questions answered by Bill Worley @Bill_Worley and Scott Allen @scott_allen at the live webinar:

Q: Is Pred Formula Y1 from PCA regression?

A: Yes, the prediction formulas for Y1 and Y2 were generated by regression analysis of the principal components. When you build the profiler with these prediction formula, you can check a box that allows you to see the individual effects instead of the principal components.

Q: What would be your main criteria for eliminating variables in the PCA example?

A: See video below for one approach.

(view in My Videos)

Q: In PLS, is it necessary to specify interactions or are the relationships among variables accommodated in the underlying model (e.g., NIPALS)?

A: For PLS, you must specify the model effects you want to test interactions or other terms. In standard JMP you can run PLS analysis with main effects only. In JMP Pro you can run PLS with other model effects, like interactions or polynomials by using the PLS personality in Fit Model.

Q: Is the gluten dataset also associated with a lot of multi-collinearity?

A: Yes. Each wavelength is dependent on the next, so there is lots of collinearity from wavelength to wavelength.

Q: Is it possible to say if PLS or Generalized Regression is generally superior with highly auto-correlated data or is it simply dataset-dependent? I was kind thinking PLS was better, but sounds like not necessarily.

A: It depends on the situation, which can be omewhat data dependent. Both give good models very close to each other.

Q: Which of these modelling options would be best for inverse prediction?

A: With wide data, you might have trouble getting the right settings. All options would be ok for inverse prediction.

Q: Could you go over how you saved the PCA formulas to the table. I only know how to save the coordinates themselves.

A: From red triangle, Save Formula to Data Table.

Q: If I use the PLS model for optimization with Prediction Profiler, will the optimization respect the correlation among the variables, or if extrapolation control is turned on, then will the correlation among the variables will be taken care of?

A: In PLS, you can do Variable Selection and then use results to build model with the important variables. Then, use Make Model with VIP and rerun. If you save these models to data table, they are big and have every variable in them.

Q: Would Model Screening be useful?

A: Yes,. It will take a time. Select Additional Methods. Be aware that by selecting that, it will go through Elastic Net. It will also go through Ridge Regression and will take time if you have K-folds and cross validation . And, in Generalized Regression and PLS Model Screening, you can add 2-way interactions and it will fit with and without the 2-way interactions.

Additional Methods.JPG

Resources

JMP documentation on Generalized Regression modeling
JMP documentation and example sequence for using Principal Components to reduce the dimensionality of your data. The purpose is to derive a small number of independent linear combinations (principal components) of a set of measured variables that capture as much of the variability in the original variables as possible. Principal component analysis is a dimension-reduction technique, as well as an exploratory data analysis tool.
JMP documentation and example sequence for using Partial Leasts Squares to develop models using correlations between Ys and Xs. PLS fits linear models based on factors, namely, linear combinations of the explanatory variables (Xs). These factors are obtained in a way that attempts to maximize the covariance between the Xs and the response or responses (Ys). PLS exploits the correlations between the Xs and the Ys to reveal underlying latent structures.

Recommended Articles