topic Difference in results between Generalized RSquare and RSquare in Discussions

Difference in results between Generalized RSquare and RSquare

AnnaPaula — Fri, 09 Jun 2023 00:25:14 GMT

Hi everyone,

I tried to find an answer to my question in the community or in JMP help, but I couldn't. So, I would really appreciate if anyone could point me in the right direction here or let me know what is the basic concept I am missing.

I am fitting a model to my response (dependent variable - DV), which is continuous, using 3 independent variables (IV) [2 categorical and 1 continuous] and the interaction between 2 of them (the other one I am just using the main effect). I first used Least Squares for that, as you can see in the screenshot below.

However, my residual by predicted plot does not look very good (I have some outliers). No matter which other factors I tried to add to the model - or transformation - I was not able to improve my residuals plot. The best I could get was using a change of my response, which is the one I am already showing above.

Next, I thought about using the Penalized Regression Method (more specifically Ridge) through the Generalized Regression. I read it can be more robust to outliers (though it will introduce some bias).

I started running the Generalized Regression with Standard Least Squares.

According to some videos I have seen, this should produce the same results as the Least Square regression.

Also, according to JMP help (https://www.jmp.com/support/help/en/15.2/index.shtml#page/jmp/training-and-validation-measures-of-fit.shtml) the Generalized RSquare provided should be between 0 and 1 and it will simplify to the traditional RSquare for continuous normal responses in the standard least-squares setting. However, as you can see from the screenshot below, although the Generalized RSquare matches the RSquare for the Training set, it does not match for the Validation set (and it is negative). The other statistics (e.g. RASE) do match both sets (Training and Validation).

Any ideas about what may be happening here?

Thank you in advance for any ideas and hints!

PS: For everything I am doing in this post, I used the FitModel platform in JMP.

Re: Difference in results between Generalized RSquare and RSquare

Mark_Bailey — Mon, 23 Nov 2020 18:26:44 GMT

Is the continuous predictor involved in the interaction term? Did you use the estimate for the omitted interaction to determine that it should be removed or did you leave it out to begin with?

What transformations of the response did you try?

Penalized regression and least squares regression do not usually lead to the same estimates of the model parameters, so the R square would not be the same. Your picture shows the OLS result, not the Ridge Regression.

The training and validation results should be similar with the best unbiased model, but not the same. How did you determine the hold out set for validation?

The negative R square is possible.

Re: Difference in results between Generalized RSquare and RSquare

SDF1 — Mon, 23 Nov 2020 18:37:10 GMT

Hi @AnnaPaula ,

If I understand correctly, you have a data set that you have split according to a validation column of some kind. One thing to consider is whether or not you stratified the validation column on your response to make sure both the training and validation columns have similar distributions.

Since the R^2 is negative, to me that would indicate you're overfitting your data. Have you done some bootstrapping on the estimates to see if the terms you have in your model are really important? Do you have a theoretical basis for including mixed terms?

Also, you might want to try Elastic Net since it's a combination of Ridge and Lasso. Another thing to consider is if your response distribution is really normal or if a different distribution is better suited to your data.

Re: Difference in results between Generalized RSquare and RSquare

AnnaPaula — Mon, 23 Nov 2020 21:59:11 GMT

Sorry @Mark_Bailey, I think I was not very clear in my question.

As you stated I did not use Ridge (that was gonna be my next step). But first, I just used again the Least Squares method (through the Generalized Regression model), to check if the results would match with the results using LeastSquares directly from the FitModel window. Because my results did not match, I did not even end up trying Ridge (which would be my next step).

But I am curious why the results from LeastSquares through Generalized Regression are not matching the results through LeastSquares directly from FitModel. Do you have any ideas?

Answering your questions:

1. The continuous predictor is not involved in the interaction term.

2. I used stepwise and also used the estimates myself to determine whether the interaction effect should be removed

3. I tried Log, Logistic, Logit, and Reciprocal. With reciprocal, the model deteriorates a lot. With Logistic, there is a slight improvement in the statistics (R-square, RASE), but not in the residual plots. So nothing to justify the use of the transformation in my opinion.

4. The training and validation were defined based on groups I have (these are computer experiments with different random seeds). I used the seeds to separate the data in groups and used that column as my holdout validation column.

5. My question related to negative was for Generalized R-square, cause when I first read the JMP help, I thought it could only go between 0 and 1. But yes, I found the formula for that one and as we can see it can also go below 0:

Nagelkerke / Cragg & Uhler’s (Generalized RSquare)

Thank you!

Re: Difference in results between Generalized RSquare and RSquare

AnnaPaula — Mon, 23 Nov 2020 22:33:53 GMT

Hi @SDF1

Thank you so much for your reply and your insights. They are very helpful and they align with a few things I was thinking.

I will provide some extra information here in case you have more insights to give (I would really appreciate it).

1. The stratified response based on the validation columns do have (very) similar distributions. As I explained to Mark, the response was calculated based on computer experiments generated using different seeds (and I used the seeds to stratify into training and validation sets).

2. Let's assume I am overfitting, my main question is:

Why my RSquare for the training set is 0.9995 using LeastSquares directly from FitModel or through Generalized Regression, while my RSquare for the validation set is 0.9791 using LeastSquares directly from the FitModel and -1.24e+12 using Least Squares through Generalized Regression.

I do not understand the differences in the RSquare values for the validation set. Shouldn't the values be the same (even if there is overfitting)?

3. I do think I have reason to include the terms in my model. First, my response was calculated with different data based on the different number of replications (one categorical IV) and different "formula/data" based on the different number of clusters (second categorical IV). The known theoretical formula to calculate the response takes into consideration the arrival rate (continuous IV), because that value affects the data generated.

I am not sure how would I do bootstrapping in JMP. But I did try different models manually (with and without the terms) and I also tried using Forward Stepwise. In both cases, the results ended up showing the two categorical IV + their interaction + the continuous IV was the best model.

4. Your last point is actually something that I was hoping someone would bring to this discussion.

My response is not actually Normal (it is LogNormal, in fact).

So, I am going to adjust that when I try Ridge or ElasticNet.

But even if my response is not Normal, shouldn't the RSquare values for the Validation sets in both images I uploaded match?

Thank you so much for your insights!

Re: Difference in results between Generalized RSquare and RSquare

Mark_Bailey — Tue, 24 Nov 2020 17:58:05 GMT

I noticed in your original residual plot that the outliers all seem to be in the validation hold-out set.

Generalized Regression uses a different parameterization than Fit Least Squares. See this page of statistical details. It also uses a different estimation method (MLE vs OLS).

Re: Difference in results between Generalized RSquare and RSquare

SDF1 — Tue, 24 Nov 2020 18:47:13 GMT

Hi @AnnaPaula ,

I'm not sure I quite follow how you used the calculated seeds to generate stratification of the validation column. But, if you use the validation column platform in JMP Pro (which you have if you can access GenReg), then you can just stratify on the response (Y variable) you're fitting. If all your outliers are in the validation set, you might want to re-think how it generated the two sets.

Another point on this you might consider is splitting off entirely -- into another data table a test data set that is not used to fit the data. If you run multiple models on the training/validation set, you can then use the test data set to see which model is really the best at prediction, you can do this using the Model Comparison platform. Presumably, your goal is to have the best predictive capability, and you need to test that somehow. It looks like you have a large enough set you could do something like that. For example, make a validation column with, e.g. 60% train, 20% validation, 20% test. Subset the test data set into an entirely different table and then use the remaining train/validate subset to build models.

For your other questions about R^2 and it being negative or not quite matching up, I'd refer to what Mark said.

If your data is log-normal distributed, you'll want to select that option in the model specification, as that changes some of the underlying processes behind how it does the fitting.

When you have created a model in SLS or GenReg or PLS, you can bootstrap the Estimates by right clicking the Estimate column and selecting bootstrap. You'll want to run several thousand and then look at the distributions to see if the original Estimate for the coefficient of that term is close to a global mean of many estimations. JMP will run some calculations with slightly different starting points and therefore generate several different estimates, you can then see that distribution and determine if overall 1) the estimate from the first try was accurate, and 2) if the coefficient for the effect is really contributing a lot or not. For example, in the SLS platform, the effects are given a FDR LogWorth value to estimate the false discovery rate for that effect. If the value is >2 (the blue line), then there's significant evidence that the effect really does contribute to the model. On the other hand, it could be smaller, or near the value 2. If you bootstrap those FDR values, you can find the mean and range to determine if the effect is meaningful or not. I've had times where an effect looked like it could be borderline, and after bootstrapping the FDR, most of the time is was actually not crossing above 2 and was therefore not really a globally important factor in the model.

As a last note, I highly recommend running multiple different kinds of modeling platforms: boosted tree, bootstrap forest, NN, etc. to see if another platform might work best for your data/situation.

Hope this helps!,

Re: Difference in results between Generalized RSquare and RSquare

AnnaPaula — Tue, 24 Nov 2020 19:30:09 GMT

Hi @Mark_Bailey , thank you so much for getting back to me again. I really appreciate all your help and patience with my questions.

Yes, for this specific response all my outliers are in the validation set. But for other responses I am investigating, this is not true (the outliers fall in both the validation and training sets).

In terms of what I am modeling: I had run a few experiments using discrete event simulation (with different seeds so that I would have different data). Based on that I calculated my four responses of interest (here I showed only one).

Now, I am trying to fit a regression model to those responses, because my next step will be to use an optimization model to find the best setting for my responses of interest.

As far as I know, the parameterization/coding should not impact the model statistics. Am I missing something here? (Coding)

Also, for linear regression and when residuals are normally distributed, OLS and MLE lead to the same coefficients. Don't they? (OLS and MLE)

According to this Generalized Regression tutorial posted by @gail_massari , around time 8:35, my understanding is that @brady_brady also says that the results between the MLE and the OLS should match.

Finally, would you mind pointing me to the JMP help page that indicates the Generalized Regression uses MLE instead of OLS when the distribution is normal?

Because of the image below (taken from JMP), I was under the wrong impression that the Generalized Regression would still use OLS if the distribution chosen was Normal:

Thank you!

Re: Difference in results between Generalized RSquare and RSquare

AnnaPaula — Tue, 24 Nov 2020 19:57:58 GMT

Thank you!

I had no idea we were able to bootstrap the estimated in JMP. Thank you for teaching me that!

Thank you also for all the other insights. Very helpful!

Re: Difference in results between Generalized RSquare and RSquare

Mark_Bailey — Wed, 25 Nov 2020 11:56:26 GMT

You are correct, the metrics for the whole model should not change with different parameterizations. Of course, the metrics for individual terms or parameters will change.

So the coefficients are not the same with different parameterizations, so their estimates must be different, too.

The OLS solution and the MLE optimization should lead to the same estimates under the same parameterization with a normally distributed response.

I misled you: the standard least squares result in GenReg is OLS. See this page for more details.

Re: Difference in results between Generalized RSquare and RSquare

AnnaPaula — Wed, 25 Nov 2020 15:33:21 GMT

Thank you for all the details and for pointing me to more information about the GenReg.

But I am still confused in this case (assuming normal distribution for the response), what would explain the differences in the Generalized Rsquare values between the validation sets among the two "platforms" (Fit Least Squares and GenReg)?

Sorry if I misunderstood something you said.

Thank you!

Re: Difference in results between Generalized RSquare and RSquare

Mark_Bailey — Wed, 25 Nov 2020 18:44:48 GMT

I cannot say from this discussion what might cause a difference. I suggest that you contact JMP Technical Support (support@jmp.com) about the matter.

Re: Difference in results between Generalized RSquare and RSquare

AnnaPaula — Mon, 30 Nov 2020 18:06:09 GMT

Thank you, Mark! I just opened a ticket about it.
I will post the answer here when I have it.