Solved: interpretation of parameter estimates for catorical variable

matthias_bruchh · Jun 8, 2023 5:23 PM

I've a model with only a few active effects. Namely the three main effects in the table. epsilon_r^n and and Rt^n are continuous variables, but E^n is a categorical variable that can have the values -1 and +1.

From the prediction profiler, it is clear that in all three cases, a higher value for the independent variable leads to a lower predicted N25 (which is also expected from a physical point of view).

Then why are the estimated coefficients for epsilon_r^n and Rt^n negative, but the coefficient for E^n is positive? How should I interpret the expression "[-1 -1]" in the parameter table?

Saving the model to the data table give access the the following model formula:

So the higher value of E^n (i.e 1) leads to a lower value in the formula (i.e. 0). I'm not sure how I should get this information from the parameter estimates table. Could some explain this to me, please?

SDF1 · Oct 30, 2020 01:28 PM

Hi @matthias_bruchh ,

Glad the change to ordinal worked and helped out.

As far as the theoretical model goes, if you have a function, then you can use the "formula" feature in the column properties to "program" a column based on the predictors. But, it sounds like you don't have a theoretical model.

For bootstrapping the estimates, you'll right click the column estimates in the JMP report and then select bootstrap. If it's fast, put in maybe 5000, but if it's slow, leave it at 2500. Be sure to check the box " fractional weights. It will re-run the fit multiple times and then give you a distribution of values for each coefficient of the predictors (just run the "distribution" script in the output data table. Then, with the distribution and statistics that show up, you can right click the summary statistics and select "make into combined data table". Be sure to also customize the summary statistics to show the proportion nonzero.

The combined data table will contain the mean and proportion nonzero for your estimates. There you can see what the mean (typical value) is as well as how often (percent-wise) that estimate is nonzero -- if that number is small, then the estimate is often zero. If it's big, then most of the time it is non-zero. This is a good way to determine what terms and mixed terms you need or don't need in your model.

You can also choose to "simulate", but this requires another formula column in the data table to swap out with a formula column you used in your model. This is a bit more involved -- see Autovalidation from Gotwalt and Ramsey. There's a PPT and video to watch that explains it. It's slightly different and will give slightly different results to the bootstrap approach, but does basically the same thing. I use these approaches all the time to eliminate factors in the model.

One very nice difference about the autovalidation approach is that it generates a null factor that is completely random and orthogonal to your data. Whenever that shows up more often in the simulations than any of your other factors, then you can definitely eliminate those that show up less frequently. Be careful here to make sure that you impose the hierarchy and heredity rules for factors and their combinations. In fact, if you have no theoretical model and no reason to include mixed terms, I'd leave them out altogether, it's just "voodoo" as a data scientist I know likes to say.

By the way, is the data your analyzing coming from a DOE or just regular lab studies?

Good luck!,

DS

View solution in original post

SDF1 · Oct 30, 2020 10:25 AM

Hi @matthias_bruchh ,

First off, do you have JMP Pro? If so, I highly recommend running a bootstrap calculation on the parameter estimates to get a better understanding of which parameters are actually 0 (or near 0) and which are really non-0 as well as their typical values (mean of thousands of bootstraps).

That being said, since you have made E^n a nominal value, when JMP reports the parameter estimates, it reports back the possible values, in this case [-1 -1], meaning -1 to 1. To better understand how the E^n predictor fits into your model, you might want to change it to ordinal instead.

The reason why E^n gets a positive coefficient (or 0) has to do with the function you get back out for your prediction. Your formula is basically the following:

y = exp( a + b*x_1 + c*x_2 + d*x_3 + f )

To get a larger y response, the argument of the exponent should be as large as possible, hence if the coefficients for x_1 and x_2 are negative and "a" and "f" are positive, the only way to get a larger value for the argument is to make the coefficient for x_3 (i.e. E^n) be positive.

You mentioned that the behavior of N25 makes physical sense when looking at how it responds to the predictors. If you are looking at a physical (or chemical) process, do you have a theoretical equation to start with? If so, you should use that as your predictive formula and not have jump generate an empirical one for you. For example, if you're measuring the time it takes for a ball to fall x-meters, there is a physical formula that predicts this, namely: y = y_0 + v_0*t + 1/2*g*t^2, the formula for motion in a gravitational field. When analyzing such data in JMP, I would fix the formula to the theoretical model and see if there are any deviations from theory.

But again, I think if you change E^n to ordinal, it might make more sense to you. Also, if you really do need to have JMP generate an empirical formula, then you'll want to bootstrap or simulate the estimates to see which are really 0 or close to zero to make sure you're not missing anything in your model.

Hope this helps.

Good luck!,

DS

matthias_bruchh · Oct 30, 2020 12:33 PM

Dear DS,

Changing E^n from nominal to ordinal indeed did the trick. The absolute value for the parameter is the same a before, but the sign changed to negative which is much more intuitive. Thanks a lot!

I indeed have JMP PRO (but only got it recently).

I don't have an empirical formula I could use in this case, but would be interested in knowing how that is done. If you could point me towards a tutorial or so I'd be grateful. The same holds for the bootstrapping you mention.

Regards,

Matthias

SDF1 · Oct 30, 2020 01:28 PM

Hi @matthias_bruchh ,

Glad the change to ordinal worked and helped out.

As far as the theoretical model goes, if you have a function, then you can use the "formula" feature in the column properties to "program" a column based on the predictors. But, it sounds like you don't have a theoretical model.

For bootstrapping the estimates, you'll right click the column estimates in the JMP report and then select bootstrap. If it's fast, put in maybe 5000, but if it's slow, leave it at 2500. Be sure to check the box " fractional weights. It will re-run the fit multiple times and then give you a distribution of values for each coefficient of the predictors (just run the "distribution" script in the output data table. Then, with the distribution and statistics that show up, you can right click the summary statistics and select "make into combined data table". Be sure to also customize the summary statistics to show the proportion nonzero.

The combined data table will contain the mean and proportion nonzero for your estimates. There you can see what the mean (typical value) is as well as how often (percent-wise) that estimate is nonzero -- if that number is small, then the estimate is often zero. If it's big, then most of the time it is non-zero. This is a good way to determine what terms and mixed terms you need or don't need in your model.

You can also choose to "simulate", but this requires another formula column in the data table to swap out with a formula column you used in your model. This is a bit more involved -- see Autovalidation from Gotwalt and Ramsey. There's a PPT and video to watch that explains it. It's slightly different and will give slightly different results to the bootstrap approach, but does basically the same thing. I use these approaches all the time to eliminate factors in the model.

One very nice difference about the autovalidation approach is that it generates a null factor that is completely random and orthogonal to your data. Whenever that shows up more often in the simulations than any of your other factors, then you can definitely eliminate those that show up less frequently. Be careful here to make sure that you impose the hierarchy and heredity rules for factors and their combinations. In fact, if you have no theoretical model and no reason to include mixed terms, I'd leave them out altogether, it's just "voodoo" as a data scientist I know likes to say.

By the way, is the data your analyzing coming from a DOE or just regular lab studies?

Good luck!,

DS

matthias_bruchh · Nov 5, 2020 02:04 AM

Thanks DS,

I will have a look a this. And yes, the data is coming from a DOI. (Actually from 3 DOIs. The project was structured in three successive phases. I made three independent DOIs for these, because there where many constraints and (at the time) it was not possible to implement them with the "augment design" feature from JMP.)

SDF1 · Nov 5, 2020 09:34 AM

Hi @matthias_bruchh ,

I hope it helps. It can sometimes be overwhelming trying to determine what one should really consider when developing an empirical model.

Since the data all come from DOEs, if you go with the autovalidation approach, you can do this in many different platforms to see how stable the leading factors are. For example, you can do standard least squares, GenReg, Gen Lin Model, PLS, etc. If you select GenReg, be aware that when choosing the estimation methods, the first five options are really the only valid ones for DOEs. The other options (Lasso, EN, etc.) are really more for unsupervised machine learning, which I don't think is what you're after. You can also do the simulation and bootstrapping of the column contributions SS (or Portion) when performing a boosted tree or bootstrap forest model. Again, if the decision tree decides to split more often on the null factor, then you know factors below that are meaningless in the model since the null factor is not only completely random and also orthogonal to your data.

Good luck with the analysis!

DS

SDF1 · Nov 5, 2020 09:45 AM

I forgot to mention to keep an eye on the Std Error column for your parameter estimates. If the magnitude of this value is larger than the magnitude of your parameter estimate, then it's an indication that your parameter estimate is likely close to zero and not actually contributing to the model in a significant way.

interpretation of parameter estimates for catorical variable

Re: interpretation of parameter estimates for catorical variable

Re: interpretation of parameter estimates for catorical variable

Re: interpretation of parameter estimates for catorical variable

Re: interpretation of parameter estimates for catorical variable

Re: interpretation of parameter estimates for catorical variable

Re: interpretation of parameter estimates for catorical variable

Re: interpretation of parameter estimates for catorical variable