Discussions

mjz5448 · Mar 24, 2025 08:44 PM

I'm trying to create a predictive regression model to predict the value of a continuous variable. Not that multicollinearity matters all that much to me since I'm purely looking for prediction, but I have a large number of categorical variables (20+) recorded (along w/ 20+ continuous variables), & I'm looking to do feature selection/variable reduction so as not to over-fit the data, & was wondering if there's a good way to do variable reduction in JMP for categorical variables or find those that might be correlated? I don't have any particular model in mind & will use JMP PRO.

Victor_G · Mar 25, 2025 1:40 AM

Hi @mjz5448,

I see two points in your questions :

How to evaluate multicollinearity when modeling categorical variables ?
You can assess multicollinearity of continuous and categorical variables in your model by displaying VIF values in the Parameter Estimates for Original Predictors (right-click on the table, then "Columns" and "VIF"). High VIF values indicate a collinearity issue among the terms in the model. You'll also have singularity information in Singularity Details column if some terms are linearly dependant of each other, as well as the indication "Biased" and/or "Zeroed" before the parameter estimates :

See more here : Models with Linear Dependencies among Model Terms
How to select features/reduce variables ?
If you want to do feature selection/variable reduction, there might several options depending on which model types you're using :
- For multivariate generalized linear regression models, you could use Pruned Forward selection, or if multicollinearity is important, try penalized Estimation Method Options like Lasso that will do variable selection by enforcing estimates at 0 for correlated variables. A less "aggressive" approach would be Elastic Net estimation, which combines Lasso and Ridge penalizations.
- Random Forest could also be interesting on its own or using it as a Predictor Screening option, using Feature Importances (Columns contribution) to identify the most important/active features.
- You could also try to use the Predictor Selection Assistant add-in that helps selecting variable based on an upgraded version of the Predictor Screening that uses Boruta Feature selection method and Rendom Forest :
- You could also combine the different modeling approaches above (and more), and compare the obtained models to see which terms are included and how similar/different the models are.

Hope this answer might help you,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

View solution in original post

Victor_G · Mar 25, 2025 1:40 AM

Hi @mjz5448,

I see two points in your questions :

How to evaluate multicollinearity when modeling categorical variables ?
You can assess multicollinearity of continuous and categorical variables in your model by displaying VIF values in the Parameter Estimates for Original Predictors (right-click on the table, then "Columns" and "VIF"). High VIF values indicate a collinearity issue among the terms in the model. You'll also have singularity information in Singularity Details column if some terms are linearly dependant of each other, as well as the indication "Biased" and/or "Zeroed" before the parameter estimates :

See more here : Models with Linear Dependencies among Model Terms
How to select features/reduce variables ?
If you want to do feature selection/variable reduction, there might several options depending on which model types you're using :
- For multivariate generalized linear regression models, you could use Pruned Forward selection, or if multicollinearity is important, try penalized Estimation Method Options like Lasso that will do variable selection by enforcing estimates at 0 for correlated variables. A less "aggressive" approach would be Elastic Net estimation, which combines Lasso and Ridge penalizations.
- Random Forest could also be interesting on its own or using it as a Predictor Screening option, using Feature Importances (Columns contribution) to identify the most important/active features.
- You could also try to use the Predictor Selection Assistant add-in that helps selecting variable based on an upgraded version of the Predictor Screening that uses Boruta Feature selection method and Rendom Forest :
- You could also combine the different modeling approaches above (and more), and compare the obtained models to see which terms are included and how similar/different the models are.

Hope this answer might help you,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

mjz5448 · Mar 25, 2025 11:50 AM

Thanks Victor! That is exactly what I was looking for.

Is there any thoughts on the steps for reducing a model:

Should you remove correlated variables/do dimension reduction 1st before moving forward w/ LASSO/Ridge?
Or should we do both - reduce variables then go to LASSO/Ridge, as well as use all features and do LASSO/Ridge and compare which has the best RASE on validation data?

Victor_G · Mar 26, 2025 04:16 AM

Hi @mjz5448,

Justo to give some explanations about the differences between LASSO and Ridge (Elastic Net being a mix of LASSO and Ridge penalizations) :

LASSO has the ability to set the coefficients for features it does not consider "interesting" to zero. This means that the model does some "automatic feature selection" to decide which features should and should not be included on its own. This property has the advantage to reduce complexity of the model (so it helps avoid overfitting). However, coefficients are biased by this penalization (shrinkage of coefficients) and LASSO regression can be "unstable" when trained on data with correlated features : one of the features gets selected somewhat arbitrarily and all of the other features that are highly correlated with that feature get effectively dropped from the model. This may lead someone to erroneously conclude that only the feature that was selected to remain in the model is important, when in reality some of the other features may be just as important or even more important.
LASSO is robust to outliers, and tends to be most effective when there is a small subset of variables with a strong effect on the response among many other variables with small or no effects.
Ridge has the ability to minimize coefficients for correlated features (close to 0 but not 0). This property enables Ridge regression to be used on datasets with many correlated features, as the negative impact of correlated features is minimized, and enables to reduce overfitting. As a penalization is introduced, you'll also get biased coefficient estimates.
Ridge enables to consider all features, and tends to be most effective when there are a large number of variables with large and comparable effects.
Elastic Net combines both penalization methods, so it shrinks some coefficients to 0 and minimize others, so it can be a good compromise to do feature selection as well as handling correlated features with similar importance on the response.

You can learn more about Generalized (and penalized) regression models in this Mastering : Using Generalized Regression in JMP® Pro to Create Robust Linear Models

For Feature selection, I would highly recommend looking at feature importances from a Random Forest, since this model doesn't require extensive finetuning, is able to handle correlated features, and isn't sensitive to overfitting, so it may provide a good benchmark comparison.

I'm not entirely sure what is your objective with this model since you mention validation data : system understanding (explainative objective) or predictive objective ?
Is it validation data (used for model optimization (hyperparameter fine-tuning, features/threshold selection, ...) and model selection), or test data (used for generalization and predictive performance assessment of the selected model on new/unseen data) ?

In the first case (validation data), you could try to use this validation data as your validation method in the Generalized Regression launch panel, to make sure the penalization is correctly set and fixed on your dataset, and compare the different models on this validation data : penalized regression methods, Random Forest, ...
In the second case (test data), I would recommend to not use this dataset until you have compared the models and select the most promising one. Depending on your main objective (system understanding or prediction) and the method to validate your model (information criterion like AICc or BIC, or K-folds/Holdback/Leave-One-Out validation if you are interested about predictive performances), you can choose an appropriate validation method and compare the models outcomes on your training/validation data to select the most interesting model. Once chosen, you can confirm its generalization and predictive performances on the test set (unseen data).

Hope this answer will help you,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

Discussions

Is there an equivalent to multicollinearity for categorical variables in JMP?

Re: Is there an equivalent to multicollinearity for categorical variables in JMP?

Re: Is there an equivalent to multicollinearity for categorical variables in JMP?

Re: Is there an equivalent to multicollinearity for categorical variables in JMP?

Re: Is there an equivalent to multicollinearity for categorical variables in JMP?

Recommended Articles