cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Browse apps to extend the software in the new JMP Marketplace
Random Distribution Values tool for Predictor Screening

To test the suitability of a factor as a predictor in tools like the Predictor Screening platform, you want to ensure that there is a suitable 'cut-off' point where you retain actual predictors and reduce the chance of overfitting.

 

Adding columns with random values/distributions (ideally matching the distributions shown in the 'real' predictors) can provide suitable cut-off point in the Predictor Screening platform, anything ranked below a random variable is unlikely to be a suitable predictor and should be considered for removal (see below, where random columns first appear in the rankings marking a suitable cut-off point in the 'bands data' in the sample data folder).

Ben_BarrIngh_0-1727699831631.png

 

To automate generating the random columns, the add-in takes the user-selected (numeric continuous) columns, finds their distribution type and creates a randomly distributed column for that distribution type, which can then be used directly in the Predictor Screening platform.

 

The add-in is aimed at reducing the work/time required to generate suitable random variables, the process is as follows:

1) Launch the add-in, this will prompt a column dialog showing all numeric continuous columns

Ben_BarrIngh_1-1727700115936.png

 

2) Select and add the variables you would like to test for distribution type and generate random columns from.

Ben_BarrIngh_2-1727700187084.png

 

3) New random data columns will be created (entitled 'Simulated', with distribution type marked)

Ben_BarrIngh_3-1727700233015.png

4) Place these random columns alongside your 'real' variables in the Predictor Screening to see how they compare (as per the first image).

 

 

Any questions, comments or improvements, let me know.

 

Thanks,

Ben

 

 

Note: this add-in was inspired by this LinkedIn post from Micha Gal (Engineering Manager at Lusix), where he showed how applying columns with random values can be an effective way to mark a 'cut-off' point in the Predictor Screening platform. Thanks for the great post!

Comments
Victor_G

Hi @Ben_BarrIngh,

 

Very interesting add-in, thanks for sharing !

I might have missed your post on LinkedIn and you might have missed my comment on the post of Mark Zwald (available here), but if you have some time, you can read the comment and previous reponse done on this topic : When should you not use predictor screening? 

I have described some cases where Predictor Screening/Random Forests could be tricked during the features selection when using random variables, due to the distribution differences or cardinality of categorical features.

 

Best recommended option would be to use randomly shuffled versions of the features (to keep distributions/cardinality properties the same, as well as min, median, mean, max values) and make each feature compete with its "shuffled counterpart".

This process is very similar to the Boruta Feature selection, and an add-in that could automatize this process (and the analysis of the results) would be highly valuable.

@Victor_G , funnily enough I was just reading your posts (and others) on predictor screening last week! Always appreciate the guidance you provide community members and I always learn something. I'll have a look into adding a 'shuffling'/Boruta option for the add-in, great idea.

 

Thanks!

Ben

Recommended Articles