Video was updated in August, 2024.
Multicollinearity is the existence of such a high degree of correlation between supposedly independent variables being used to estimate a dependent variable that the contribution of each independent variable to variation in the dependent variable cannot be determined. High multicollinearity in your data set means any model you build will almost certainly be overfit unless you use techniques to mitigate this issue. There are several modeling techniques that can be used to improve the likelihood of building a good predictive model by limiting multicollinearity. These include Principal Components Analysis and Partial Least Squares, which are both available in JMP, and Generalized Regression, which is available in JMP Pro as a Fit Model option.
See how to:
- Perform Principal Components Analysis (PCA) to use a small number of components to model a set of data by limiting the variability and mitigating multicollinearity
-
PCA compares variables to each other, not to the output
-
PCA extracts linear combinations of the variables that help explain the variability, as much variability as possible to reduce the number of importat variables
- The best data sets for PCA analysis are either tall or wide and could be both
- Does not interpret variables as inputs or outputs - deals in a single matrix
- Extract linear combinations of variables that explain as much variability as possible
- This all starts with the 1st Principal Component and the algorithm continues until 100% of the variation is explained. Hopefully this is a number much small than your original set of variables
- Each successive PC explains as much variation as possible and is orthogonal to the loading vectors of the previously extracted
- Examine and interpret Summary Plots and Eigen Values
- Save Principal Components to data table and perform Principle Component regression using Graph>Profiler (Reminder: Check Expand intermediate values box)
- Use Partial Least Squares to fit linear models based on factors, namely, linear combinations of the explanatory variables (Xs)
- These factors are obtained in a way that attempts to maximize the covariance between the Xs and the response or responses (Ys).
- PLS exploits the correlations between the Xs and the Ys to reveal underlying latent structures
- PLS performs well in situations where the use of ordinary least squares (OLS) does not produce satisfactory results, including when there are more X variables than observations; highly correlated X variables; a large number of X variables; and several Y variables and many X variables.
- PLS can be used when the predictors outnumber the observations
- PLS is used widely in modeling high-dimensional data in areas such as spectroscopy, chemometrics, genomics, psychology, education, economics, political science, and environmental science.
- See how to use JMP PLS for two model fitting algorithms are available: nonlinear iterative partial least squares (NIPALS) and a “statistically inspired modification of PLS” (SIMPLS)
- For a single response, both methods give the same model; for multiple responses, there are slight differences
- PLS uses the van der Voet T2 test and cross validation to help you choose the optimal number of factors to extract.
- Interpret the Root Mean PRESS (predicted residual sum of squares) Plot, which shows the number of factors with the lowest PRESS value, that model being the least of the better models you'll get for the data
- Use the Variance Importance Plot (VIP) to locate factors to remove
- Interpret Percent Variation Plots
- The van der Voet T2 statistic tests to determine whether a model with a different number of factors differs significantly from the model with the minimum PRESS value
- A common practice is to extract the smallest number of factors for which the van der Voet significance level exceeds 0.10
- Use JMP Pro Generalized Regression advanced penalized regression technique(s) that are especially good for highly correlated and/or non-normally distributed data and variable selection
- Build LASSO model, interpret results and understand some of its limitations.
- Build Elastic Net model and interpret for highly-correlated data and understand some of the benefits over LASSO
- Use JMP Pro Generalized Regression for high-dimensional spectral data
- Examine and interpret PCA Model Driven Multivariate Control Charts, PCA on Correlations Summary Plots and Eigen Values
- Examine and interpret PLS results
- Interpret Elastic Net Generalized Regression including Generalized R-squared
- Use JMP Pro Generalized Regression for very wide data and interpret results
Note: the attached .jmpprj file includes scripts to run the analyses from the demo.
Questions answered by Bill Worley @Bill_Worley and Scott Allen @scott_allen at the live webinar:
Q: What did he mean when he said that preferred data for PCA is tall and wide?
A: You can think of "tall" data as a data table that has many rows and few columns; "wide" data has many columns and fewer rows (fewer observations). PLS, on the other hand, is often used when the number of factors is more than the number of observations.
Q: If you have highly correlated variables and one of the algorithms is suggesting one over another, how do you select which variable to keep, or which one not to use?
A: Variance Inflation Factor (VIF) is a way of finding which values are highly correlated. VIF just gives you an indication of how highly correlated they are really with other variables. You can find highly correlated factors and when removing one, you will impact the VIF. Then use Multivariate Analysis to further understand which variables are highly correlated. See video:
Q: Is it better to use BIC over AIC in Generalized Regressions?
A: It depends on the goal of your model. BIC penalizes models with more parameters more than AICc. So, it usually leads to models with fewer parameters.
Q: Does the user have option which model to use rather than the model JMP Selected?
A: You can run model comparison and examine the best validation or test results. You can also use tools - Remove Elastic Net Alph and JMP goes through three options to show you the best of the three models. It may take a while to run these if you have a lot of data.
Q: How would you use these techniquest for time series data?
A: Time series data is best anlayzed using the Time Series or FDE tools in JMP. See information on Time Series and a video on analyzing curved data.
Questions answered at previous webinars on this topic:
Q: Is Pred Formula Y1 from PCA regression?
A: Yes, the prediction formulas for Y1 and Y2 were generated by regression analysis of the principal components. When you build the profiler with these prediction formula, you can check a box that allows you to see the individual effects instead of the principal components.
Q: What would be your main criteria for eliminating variables in the PCA example?
A: See video below for one approach.
Q: In PLS, is it necessary to specify interactions or are the relationships among variables accommodated in the underlying model (e.g., NIPALS)?
A: For PLS, you must specify the model effects you want to test interactions or other terms. In standard JMP you can run PLS analysis with main effects only. In JMP Pro you can run PLS with other model effects, like interactions or polynomials by using the PLS personality in Fit Model.
Q: Is the gluten dataset also associated with a lot of multi-collinearity?
A: Yes. Each wavelength is dependent on the next, so there is lots of collinearity from wavelength to wavelength.
Q: Is it possible to say if PLS or Generalized Regression is generally superior with highly auto-correlated data or is it simply dataset-dependent? I was kind thinking PLS was better, but sounds like not necessarily.
A: It depends on the situation, which can be omewhat data dependent. Both give good models very close to each other.
Q: Which of these modelling options would be best for inverse prediction?
A: With wide data, you might have trouble getting the right settings. All options would be ok for inverse prediction.
Q: Could you go over how you saved the PCA formulas to the table. I only know how to save the coordinates themselves.
A: From red triangle, Save Formula to Data Table.
Q: If I use the PLS model for optimization with Prediction Profiler, will the optimization respect the correlation among the variables, or if extrapolation control is turned on, then will the correlation among the variables will be taken care of?
A: In PLS, you can do Variable Selection and then use results to build model with the important variables. Then, use Make Model with VIP and rerun. If you save these models to data table, they are big and have every variable in them.
Q: Would Model Screening be useful?
A: Yes,. It will take a time. Select Additional Methods. Be aware that by selecting that, it will go through Elastic Net. It will also go through Ridge Regression and will take time if you have K-folds and cross validation . And, in Generalized Regression and PLS Model Screening, you can add 2-way interactions and it will fit with and without the 2-way interactions.
Resources
- JMP documentation on Generalized Regression modeling
-
JMP documentation and example sequence for using Principal Components to reduce the dimensionality of your data. The purpose is to derive a small number of independent linear combinations (principal components) of a set of measured variables that capture as much of the variability in the original variables as possible. Principal component analysis is a dimension-reduction technique, as well as an exploratory data analysis tool.
- JMP documentation and example sequence for using Partial Leasts Squares to develop models using correlations between Ys and Xs. PLS fits linear models based on factors, namely, linear combinations of the explanatory variables (Xs). These factors are obtained in a way that attempts to maximize the covariance between the Xs and the response or responses (Ys). PLS exploits the correlations between the Xs and the Ys to reveal underlying latent structures.