How to interpret Predictor Screening results and is this general
Mar 13, 2019 10:13 AM(1860 views)
I'm a masterstudent in bio-engineering and I'm currently working on my thesis, which includes a lot of data processing. I have 66 features of a signal in about 248 windows (each 5 seconds), which are the input of my classifier. The ouput is the level of a specific symptom, an ordinal variable. I wanted to apply 'feature selection' before the classification because I want to know what physical parameters correlate well with the symptom.
I was thinking about using the 'predictor screening' option of JMP. But to do this, I need to understand 2 extra things.
- can this 'predictor screening' be used as a general feature selection tool? After feature selection, I want to try different classifiers, not just bootstrap tree forest (which is used in the predictor screening tool).
- How do I exactly interpret the 'contribution' level of the output of predictor screening? I've found a lot of confusing information about this.
Extra remark; if you have any other ideas for feature selection, let me know! :-)
The Predictor Screening platform is one of many which can be used for variable selection...in fact just about any modeling platform in JMP/JMP Pro can be used for variable selection. Each has their place in the sun. I suggest reading the JMP documentation for various platforms such as Stepwise, Partition, Partial Least Squares and if you have JMP Pro...and it sounds like you do...Bootstrap Forest and Boosted Tree. Or the Elastic Net or Adaptive Lasso sub personalities in the Generalized Regression platform personality...
But first...how much time and effort have you invested in understanding the process behavior AND the genealogy of the data itself? Reason I ask is let's just say the results of the Predictor Screening platform run counter the known laws of medicine (sounds like you have a medical application???)? For example, if the platform report suggests that increased smoking leads to a decrease probability of lung cancer...well something is amiss. Or data quality? Missingness, outliers, nonsense values, hidden or less than obvious multivariate multicollinear structure in the data? All these issues can wreak havoc with modeling. So I recommend FIRST spending time in this space with platforms from the good old fashioned Distribution platform (for a univariate view of the world ) to the Explore Outliers/Missing Values, and Principle Components platforms to help insure you've got high quality, reasonable, data for predictors AND responses. Last step here is think about your modeling validation strategy. For variable selection type problems I recommend using the Train/Validation/Test validation construct available in the JMP Pro Make Validation platform.