Share your ideas for the JMP Scripting Unsession at Discovery Summit by September 17th. We hope to see you there!
Choose Language Hide Translation Bar
Highlighted
Lu
Lu
Level III

Variable selection and Bootstrap Forest

Hello,

I performed a Bootstrap forest analysis wirth a categorical ouctcome with 230 features, many of them are highly correlated. There is a hughe difference in performance  between my training (AUC 0.98) and validation set (0.68). Since the data are highly dimensional and many of the variables are correlated variable reduction is needed. Can I use GenReg with Lasso or other penalized techniques for this purpose?. My data are not realy normally distributed. This is one of the reasons why I prefered to use a Tree model.

Thanks in advance,

Lu

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Variable selection and Bootstrap Forest

You could also use the Predictor Screening platform. It uses a bootstrap forest to collect the variable contributions, which can be used to select variables for a model. The bootstrap forest should avoid collinearity for the most part.

Learn it once, use it forever!

View solution in original post

5 REPLIES 5
Highlighted
P_Bartell
Level VI

Re: Variable selection and Bootstrap Forest

Lasso and Elastic Net methods can be useful when there is multicollinearity among the predictor variables. What's not 100% clear to me is when you say, "...230 features...", are these categorical levels of the response? Or your predictor variable set?

 

You may want to read the JMP Pro documentation regarding these methods:

 

https://www.jmp.com/support/help/en/15.2/#page/jmp/overview-of-the-generalized-regression-personalit...

 

You may also want to try out Partial Least Squares.

 

You don't mention the size of your dataset. With a categorical response sometimes a small dataset can be very sensitive to the structure of the training, validation, and optionally, test data sets. Did you stratify each by the levels of the response? This is a generally recognized best practice for creating the training, validation and test datasets.

Highlighted
Lu
Lu
Level III

Re: Variable selection and Bootstrap Forest

Thaks for respons,

The size of the dataset is 1050 cases with 1 categorical Y (slightly unbalanced 45/55).The 230 features I mentioned are variables (X), not the response. Some of them are categorical most of them numerical. I used thet stratified Training/validation/test method (60/20/20) method for model training and validation.

What is the advantage of PLS compared to Lasso and Elastic Net?

 

Regards,

 

Lu

Highlighted
P_Bartell
Level VI

Re: Variable selection and Bootstrap Forest

@Lu : When I spoke of stratifying the response across the levels of the categorical response I was not referring to the % splits for the size of the training, validation, test data sets. What I'm recommending is when you select the response column as the stratification column, then under the 'Select Method' window, select "Stratified Validation Column". This forces as close as you can get to an equal proportion of each level of the categorical response within the training, validation, and optionally, test sets. Sometimes the gremlins of pure random selection will create a significant IMBALANCE in the levels across the response making model validation potentially more difficult. Here's some more documentation:

 

https://www.jmp.com/support/help/en/15.2/#page/jmp/launch-the-make-validation-column-platform.shtml

 

As for partial least squares, advantages, well I'm not sure there are any clear cut advantages. But one thing PLS is VERY good at, is creating latent variables for your multicollinear predictors, that can in turn be used for modeling purposes. So it's just a different mathematical treatment of the predictors compared to Lasso, Elastic Net, or any of the tree based methods. See the JMP documentation for this approach:

 

https://www.jmp.com/support/help/en/15.2/#page/jmp/partial-least-squares-models.shtml

 

My general recommendation is try many different modeling methods...both the linear and non linear type, including those mentioned by @markbailey , and then export each model's results to either the Formula Depot or Model Comparison platforms and see which model works best.

 

There isn't any one method that fits all problems. So try many and hope you can find something that works.

Highlighted
dale_lehman
Level VI

Re: Variable selection and Bootstrap Forest

For me, having 1050 cases and 230 features is an invitation for over-fitting.  Your dramatic difference between performance on the training and validation data matches what I would expect from over-fitting.  In my experience, the default settings in JMP's model platforms generally do a good job at avoiding the problem, but perhaps your data is an extreme case.  You might try forcing the tree models in ways to avoid over-fitting (e.g., you can raise the minimum split size to force a simpler tree).  In the end, I do think you will need to reduce the feature set one way or another - the key is to find the features that are likely to be the most important - and to have some confidence that you have the right set (correlations among the 230 features might make that difficult).  You also might try some principal-components analysis to reduce the number of features.

Highlighted

Re: Variable selection and Bootstrap Forest

You could also use the Predictor Screening platform. It uses a bootstrap forest to collect the variable contributions, which can be used to select variables for a model. The bootstrap forest should avoid collinearity for the most part.

Learn it once, use it forever!

View solution in original post

Article Labels