Solved: Variable selection and Bootstrap Forest

Report Inappropriate Content · Jun 8, 2023 5:19 PM

Hello,

I performed a Bootstrap forest analysis wirth a categorical ouctcome with 230 features, many of them are highly correlated. There is a hughe difference in performance between my training (AUC 0.98) and validation set (0.68). Since the data are highly dimensional and many of the variables are correlated variable reduction is needed. Can I use GenReg with Lasso or other penalized techniques for this purpose?. My data are not realy normally distributed. This is one of the reasons why I prefered to use a Tree model.

Thanks in advance,

Lu

Mark_Bailey · Aug 14, 2020 09:51 AM

You could also use the Predictor Screening platform. It uses a bootstrap forest to collect the variable contributions, which can be used to select variables for a model. The bootstrap forest should avoid collinearity for the most part.

View solution in original post

P_Bartell · Aug 14, 2020 5:53 AM

Lasso and Elastic Net methods can be useful when there is multicollinearity among the predictor variables. What's not 100% clear to me is when you say, "...230 features...", are these categorical levels of the response? Or your predictor variable set?

You may want to read the JMP Pro documentation regarding these methods:

https://www.jmp.com/support/help/en/15.2/#page/jmp/overview-of-the-generalized-regression-personalit...

You may also want to try out Partial Least Squares.

You don't mention the size of your dataset. With a categorical response sometimes a small dataset can be very sensitive to the structure of the training, validation, and optionally, test data sets. Did you stratify each by the levels of the response? This is a generally recognized best practice for creating the training, validation and test datasets.

Lu · Aug 14, 2020 09:34 AM

Thaks for respons,

The size of the dataset is 1050 cases with 1 categorical Y (slightly unbalanced 45/55).The 230 features I mentioned are variables (X), not the response. Some of them are categorical most of them numerical. I used thet stratified Training/validation/test method (60/20/20) method for model training and validation.

What is the advantage of PLS compared to Lasso and Elastic Net?

Regards,

Lu

P_Bartell · Aug 14, 2020 10:25 AM

@Lu : When I spoke of stratifying the response across the levels of the categorical response I was not referring to the % splits for the size of the training, validation, test data sets. What I'm recommending is when you select the response column as the stratification column, then under the 'Select Method' window, select "Stratified Validation Column". This forces as close as you can get to an equal proportion of each level of the categorical response within the training, validation, and optionally, test sets. Sometimes the gremlins of pure random selection will create a significant IMBALANCE in the levels across the response making model validation potentially more difficult. Here's some more documentation:

https://www.jmp.com/support/help/en/15.2/#page/jmp/launch-the-make-validation-column-platform.shtml

As for partial least squares, advantages, well I'm not sure there are any clear cut advantages. But one thing PLS is VERY good at, is creating latent variables for your multicollinear predictors, that can in turn be used for modeling purposes. So it's just a different mathematical treatment of the predictors compared to Lasso, Elastic Net, or any of the tree based methods. See the JMP documentation for this approach:

https://www.jmp.com/support/help/en/15.2/#page/jmp/partial-least-squares-models.shtml

My general recommendation is try many different modeling methods...both the linear and non linear type, including those mentioned by @Mark_Bailey , and then export each model's results to either the Formula Depot or Model Comparison platforms and see which model works best.

There isn't any one method that fits all problems. So try many and hope you can find something that works.

dale_lehman · Aug 15, 2020 09:42 AM

For me, having 1050 cases and 230 features is an invitation for over-fitting. Your dramatic difference between performance on the training and validation data matches what I would expect from over-fitting. In my experience, the default settings in JMP's model platforms generally do a good job at avoiding the problem, but perhaps your data is an extreme case. You might try forcing the tree models in ways to avoid over-fitting (e.g., you can raise the minimum split size to force a simpler tree). In the end, I do think you will need to reduce the feature set one way or another - the key is to find the features that are likely to be the most important - and to have some confidence that you have the right set (correlations among the 230 features might make that difficult). You also might try some principal-components analysis to reduce the number of features.

Mark_Bailey · Aug 14, 2020 09:51 AM

You could also use the Predictor Screening platform. It uses a bootstrap forest to collect the variable contributions, which can be used to select variables for a model. The bootstrap forest should avoid collinearity for the most part.

Variable selection and Bootstrap Forest

Re: Variable selection and Bootstrap Forest

Re: Variable selection and Bootstrap Forest

Re: Variable selection and Bootstrap Forest

Re: Variable selection and Bootstrap Forest

Re: Variable selection and Bootstrap Forest

Re: Variable selection and Bootstrap Forest

Recommended Articles

Get Going with JMP: Essentials for Using JMP

Analytics with Confidence 2: Models That Don't Generalise

Hiding and Excluding Data