BookmarkSubscribeRSS Feed
Choose Language Hide Translation Bar
gene

Community Trekker

Joined:

Jun 23, 2011

Partial least squares parameter variance

Hopefully this question makes sense.  I'm doing logistic regression with N=2000 observations and p=160 covariates.  There is obvious correlation structure in the design matrix and I can see that simply looking at the VIF's when I do simple logistic regression.  I then did PLS and get a different set of coefficients, though there are still 160 of them as PLS does not do variable reduction.  Is there a place in the PLS report to find the variances of the new set of coefficients to see that they are reduced?  Does it make sense that I'd want to look at these since I might be trading bias for variance.   

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted
Peter_Bartell

Joined:

Jun 5, 2014

Solution

Re: Partial least squares parameter variance

One commonly employed technique for checking for overfitting when you have enough observations is to divide up the original data set into a training, validation, and optionally test grouping using the Make Validation subplatform in the Predictive Modeling platform collection. For classification problems it's best practice to create a stratified sample on the Y so you have equal proportions of all response levels in each training, validation and optional test groups. You have a fairly sizable imbalance between the 1 and 0 levels for your response. So you may want to try out a few scenarios where the balance is closer to 50/50? Once you fit the model, then examine the ROC curves for each grouping...if the AUCs associated with each grouping are reasonably close...then there is lack of evidence of overfitting. Also check the Confusion Matrices...I'm a little alarmed at the results I got with the training/validation sets I created in your original data table...the model does a pretty poor job, misclassifying actual 1's as 0's when the target = 0 no matter which modeling approach I used. See the three different analysis scripts...logistic, Gen Reg/Lasso, and Neural Net in the attached data table. If this is problematic from a practical point of view then your model is NOT fitting well at all...forget about overfitting...it's just a poor classifier.

4 REPLIES 4
Peter_Bartell

Joined:

Jun 5, 2014

Re: Partial least squares parameter variance

Gene:

 

Your questions make sense...however can I offer an alternative thought? If variable reduction is key to your practical problem, have you considered using either the Lasso or Elastic Net subpersonalities from the Fit Model -> Generalized Regression personality/platform? These techniques are suited for multicollinearity among the predictors...but they have some important characteristics to consider as well...these are explained in the JMP documentation. Both can be used with a categorical response...and the 'final' parameter estimate confidence intervals are reported right along with the table of parameter estimates.

gene

Community Trekker

Joined:

Jun 23, 2011

Re: Partial least squares parameter variance

Thanks Peter,

 

In my case it is really all about the correlation structure of the predictors.  I'm not convinced that variable reduction is warranted in my application.  In fact, that is a question I haven't answered yet and I'm searching for a definitive way to test whether overfitting is occuring even when I use all my predictors.  I have N=1839 observations with p=148 covariates.  I've included the table.  If we just use regular regression (treat the nominal output as continuous) we can see that there are plenty of very large VIF's.  So we know we will also have a problem when we do logistic regression.  We have correlation problems for dure, but maybe not an overfitting problem.  That's why I went with PLS.  If you cluster the variables you get only 35 clusters, suggesting that there are only 35 or so informative features within the 148 covariates.  So why focus on eliminating any of them instead of just using a method robust to correlation.

 

When I run logistic regression I get an AUC of only .815.  So even using all the covariates doesn't give me a perfect fit.  Is AUC < 1 a legit way to assert that overfitting is not the issue.

 

I welcome any thoughts you have.

Highlighted
Peter_Bartell

Joined:

Jun 5, 2014

Solution

Re: Partial least squares parameter variance

One commonly employed technique for checking for overfitting when you have enough observations is to divide up the original data set into a training, validation, and optionally test grouping using the Make Validation subplatform in the Predictive Modeling platform collection. For classification problems it's best practice to create a stratified sample on the Y so you have equal proportions of all response levels in each training, validation and optional test groups. You have a fairly sizable imbalance between the 1 and 0 levels for your response. So you may want to try out a few scenarios where the balance is closer to 50/50? Once you fit the model, then examine the ROC curves for each grouping...if the AUCs associated with each grouping are reasonably close...then there is lack of evidence of overfitting. Also check the Confusion Matrices...I'm a little alarmed at the results I got with the training/validation sets I created in your original data table...the model does a pretty poor job, misclassifying actual 1's as 0's when the target = 0 no matter which modeling approach I used. See the three different analysis scripts...logistic, Gen Reg/Lasso, and Neural Net in the attached data table. If this is problematic from a practical point of view then your model is NOT fitting well at all...forget about overfitting...it's just a poor classifier.

gene

Community Trekker

Joined:

Jun 23, 2011

Re: Partial least squares parameter variance

Thanks Peter,

 

One thing that troubles me is the large difference in the models when adaptive lasso is selected.  You can see in the adaptive model that the Scaled-LogLikelihood value doesn't have a well defined minimum.  Another issue is that you have modeled Y=0 and you get 59 terms in the model.  If you target Y=1 you get a completely different model with 71 terms.  I don't get this.  Without the adaptive set it doesn't matter whether I target 0 or 1, the models are the same except the terms have different signs.  Since I find the adaptive results difficult to understand, let alone explain to my customers, I stick with vanilla lasso.

 

With vanilla lasso we can get reasonable performance in the confusion matrices with a threshold of .1522.  I say reasonable in the sense that the required sensitivity and specificity are perhaps met.

 

That said, we still have the question of whether we need to do regularization at all.  Even vanilla logistic regression results in te 2 AUC's being somewhat close.  But we wouldn't want to use LR because we know there is serious correlation.  So, without significant evidence of overfitting, I'd move to PLS in order to address only the correlation issue.  But since JMP models the target as continuous we don't get all the ROC, AUC and confusion matrices in the report.  I can get them by applying the logit to the linear predictor and I think that is legit...and that circles back to my original question on how to treat PLS results for model comparison when the other models are based strictly on logit transformation of the linear predictors.

 

By the way, I greatly appreciate your diving into these questions since my own thorough understanding of what the heck I'm doing is very important to me.

 

I've included the same data table with a few more models.