cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Try the Materials Informatics Toolkit, which is designed to easily handle SMILES data. This and other helpful add-ins are available in the JMP® Marketplace
Choose Language Hide Translation Bar
User5555
Level I

Wide confidence intervals in Profiler for Binomial Generalized Regressions

Hello all,

 

I am doing predictive modeling and would appreciate feedback on how well my model fits as I am new to modeling and don't want to make any mistakes in model selection. I have a binomial categorical response and continuous predictors (n=54, p=15). I have chosen a penalized model, Elastic Net, and my model seems to have an okay fit (RSquare = .28, ROC = .7888, pdf file attached). 

 

I am concerned about the large confidence intervals shown in the profiler graphs, and want to make sure my model isn't overfit. That being said, the example binomial general regression shown in jmp documentation also has wide confidence intervals, so I'm wondering what is acceptable/expected with a binomial general regression (https://www.jmp.com/support/help/en/17.1/index.shtml#page/jmp/example-of-binomial-generalized-regres... ). 

 

Do you have any feedback on if my model is showing signs of being overfit or if it looks safe to proceed? Are there any other metrics I should be considering in evaluating my fit?

 

Thank you in advance for taking a look at this! 

6 REPLIES 6
statman
Super User

Re: Wide confidence intervals in Profiler for Binomial Generalized Regressions

Welcome to the community.  I'm not sure I can provide all of the advice you need regarding modeling on this forum.  What version of JMP are you using? It is also very difficult to provide advice with such little context.  What is the response?  How confident are you in the measurement of the response?  How did you get the data set?

There are a number of statistics to use for evaluation of over-fitting (e.g.,  compare the RSquare with the RSquare Adj.  As this delta increases, it is evidence you have unimportant terms in the model, negative RSquares from a validation data set).  Looking at the pdf you sent (it is much easier for us to look at what you have when you attach the data table), it is difficult to assess.  The RSquare indicates your model explains <30% of the variation in the data.  Gene 4 appears the only Gene to be interesting (relatively significant p-value).  Standard errors seem large.

 

You may find these links helpful

 

https://www.jmp.com/support/help/en/17.0/?os=mac&source=application#page/jmp/estimation-method-optio...

https://www.jmp.com/support/help/en/17.1/index.shtml#page/jmp/example-of-the-model-comparison-table-...

 

 

"All models are wrong, some are useful" G.E.P. Box
User5555
Level I

Re: Wide confidence intervals in Profiler for Binomial Generalized Regressions

Hi Statman,

 

Thank you for your response!! I have attached the data table to my original question. I have to protect some level of confidentiality, but I can say that the response is a medical diagnosis and it is determined according to medical guidelines. I am certain that the diagnoses are correct according to those guidelines. The predictors are derived from deep sequencing data that we generated. We started with 6,000 measured responses, and I have done data reduction (blind to these samples); after reducing as much as possible, I ended up with 15 predictors. 

 

It's possible that JMPs predictor finder would highlight better predictors than the 15 I ended up with, however, the literature seems to be moving toward blinded data reduction methods for clinical pursuits. I agree it looks like Gene 4 is the only significant predictor on its own from this batch. 

 

Thanks for the tip about looking at over-fitting indications in the R^2 adj. value. That value doesn't seem to be included in my output. Do you know I can produce it? Thanks!!

statman
Super User

Re: Wide confidence intervals in Profiler for Binomial Generalized Regressions

I understand the sensitive nature of the data, but the response is missing from the table, so impossible to model.  About all I can do is multivariate for the Genes (and there are some outliers).  

Sorry, I couldn't resist (not meant to be antagonistic, so please humor me):

I had to laugh regarding this comment. "the response is a medical diagnosis and it is determined according to medical guidelines. I am certain that the diagnoses are correct according to those guidelines."  Not laugh as in haha, but laugh in terms of  based on guidelines.  Guidelines are by definition, not definitive.  I have a rare auto immune disease that was mis-diagnosed multiple times by multiple Drs. using  medical guidelines (clinical guidelines)...

 

R-Squares:

https://www.jmp.com/support/help/en/17.0/?os=mac&source=application#page/jmp/statistical-details-for...

 

"All models are wrong, some are useful" G.E.P. Box

Re: Wide confidence intervals in Profiler for Binomial Generalized Regressions

You might use some form of cross-validation to assess bias and variance from under- or over-fitting the data.

 

Binomial or dichotomous responses generally display a higher degree of uncertainty in predictions compared to a quantitative response.

 

The question of what is expected or acceptable is subjective. There are no external rules or guidance.

User5555
Level I

Re: Wide confidence intervals in Profiler for Binomial Generalized Regressions

Hi Mark,

Thank you for the response!! I am using AICc as the validation, which I thought was the best option given my small sample size n=54. Would you recommend using k fold or leave-one-out instead? 

As for the uncertainty: one of my goals is to be able to point out which genes have a positive correlation with the response (i.e. probability of response is higher when the gene counts are higher). Would I be unable to do so with these large confidence intervals? Is my choice in using penalized regression models the best way to go, or is there another method you would recommend that could provide more certainty with small sample sizes? I've looked into a lot of methods, and it seems a penalized model is favored for instances where the number of predictors is greater than n/10 (according to Dr. Frank Harrel's book on Generalized Regressions and other sources). But if it's not fitting well, I suppose should consider other methods. Also, I just attached my data table to my question. 

 

I'm not sure what the scope of this forum is, so my apologies if my questions are outside it! No worries if you can't answer them:) Thanks! 

Re: Wide confidence intervals in Profiler for Binomial Generalized Regressions

AICc is about model selection among candidates. It does not address validation. JMP offers a variety of methods for cross-validation, depending on whether you use JMP or JMP Pro. K-fold XV is generally a good choice for small data sets. Compare AICc between the training and the validation hold out sets. They should not be very different.