Re: Generalized Regression vs Standard Least Squares

avogadrosmole · Jun 10, 2023 1:41 PM

I have a 32 row data set of 6 factors and 1 response from a DOE I just performed. The response data is not normally distributed and should only contain values between 0 and 1. I saw the generalized regression platform in JMP Pro allows you to specify a variety of distributions including the beta distribution where the response is between 0 and 1. However, I also found a function I can use to transform my response data to be normally distributed and it also only allows values between 0 and 1 for the untransformed response.

My question is: what would be the difference between using generalized regression with the beta distribution specified vs applying the transformation to my data set and then using standard least squares (and stepwise) with the transformed data?

P_Bartell · Dec 10, 2020 09:56 AM

Why is it a big deal that your response data is not normally distributed? I'd suspect something was awry IF your response data is normally distributed. Remember, generally speaking in DOE we are examining a wide space of k factors hoping to elicit a response signal above the noise. It is one of the great misconceptions of statistics that response data should be normally distributed for the magic of modeling (OLS or otherwise) to work. For OLS what is assumed is the errors wrt to predicted y's are normally distributed. That's very different than the raw response data itself. Now having said all this why don't you try ALL your ideas for modeling, and in the context of the practical problem at hand, decide which model is most 'useful'. That's what you're after I suspect...

HadleyMyers · Dec 11, 2020 3:13 AM

Hi,

I'm wondering about the cases where data isn't normally distributed because there is a hard stop (at 0 or 100%, for example). The residuals wouldn't be normally distributed either because they'd be skewed by the boundary. Does your answer still apply in these cases?

P_Bartell · Dec 11, 2020 10:07 AM

@HadleyMyers The key is to not assume apriori modeling that simply because the raw response data is not normally distributed that OLS modeling assumptions will not be met. At the very least try an OLS model...then if the residuals are NOT normally distributed, time to consider some alternative modeling approaches...transforming variables, alternative modeling techniques, etc.

People, especially newer practitioners of statistical methods, often times get hung up on this 'the data has to be normally distributed to (fill in the blank statistical methods)." I can't tell you how many times over the years people told me, "I can't use control charts because my data is not normally distributed."

Mark_Bailey · Dec 11, 2020 10:26 AM

You might also try another approach: use the logit transform on the response. Complete the Fit Model dialog as usual. Then select the Y column, click the red triangle next to Transform, and select Logit.

NOTE: Don't try this example, the response weight does not have the appropriate range of 0-1.