cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Discussions

Solve problems, and share tips and tricks with other JMP users.
Choose Language Hide Translation Bar
PhamBao
Level III

How to pick independence parameters to optimize regression model

Hi team,



During performing Fit model, I found that picking different parameters will generate different outcomes. After having the model, I try to take a look on data points with high Studentized Residual and validate whether they are valid or not . I need to repeat the task of picking parameters a lot of time to get the proper model. Lets take a look on an example below

Model 1:

PhamBao_0-1770115901201.png

Model 2:

PhamBao_1-1770115938423.png

As you can see that, Model 2 has a lot of data points with high Studentized Residual compared to Model 1's. When I validated these data points with high Studentized Residual, these data points were data points  categorized as Bad. It seems that Model 2 is more robust

My question is although 2 models have high RSquare scores, why Model 2 could screen out more bad data points. It could be because of parameters that I picked for creating the model
Questions:
-If I classify data points as Good/Bad in new column of data set, is there any method in JMP could suggest which parameters that I should pick, so that the model could be more robust to screen out Bad data points.

-If I do not classify data points ,  is there any method in JMP could suggest which parameters that I should pick, so that the model could be more robust to screen out Bad data points.

Hopefully I could get advice from the community

Appreciates

3 REPLIES 3
dlehman1
Level VI

Re: How to pick independence parameters to optimize regression model

My advice is for you to provide more information.  Certainly your second model seems to fit the data better than the first, but I don't think that is sufficient to decide which model (if either) to use.  You say you picked different parameters but I think you mean you picked different variables as factors in your model.  Deciding what variables to use is a complicated question, only answered in part by how well the model fits the data.  It looks like both models are linear models, so there are a variety of nonlinear models you might want to try - especially if you are concerned about the outliers.  Regarding those large residuals, I have two thoughts.  First, with so much data I doubt that the large residuals influence the model much.  Second, you should explain why you are concerned with the large residuals (and one option to try would be a log transformation).

In general, I would need to know much more about what your response variable is, what kind of data you have, and what the purposes of your analysis are.  The "best fitting model" is only one part of the analysis.  And most measures of "best fit" are looking at averages rather than particular kinds of observations - you concern about the large residuals seems to suggest that some points are more important than others to you.  If that is the case, then a linear regression model might not be the preferred model to use (those large residuals don't matter much - for example, a quantile regression might be more appropriate if you are concerned with the accuracy of your model in particular regions of observations rather than on average).

PhamBao
Level III

Re: How to pick independence parameters to optimize regression model

"what the purposes of your analysis are"
Typically, I have 5 factors ( temperature, force, height, thickness and time) in the recipe to make the product ( weight). My assumption that if I have the sufficient history data set of factors( temperature, force, height , thickness and time) and dependence variable (weight) considered as a product, I could make the model and get the formula of model like y=bo+b1x1+b2x2 +.... Then I apply this formula into my online process system to compare the actual weight and predicted weight from the model. Not sure I am correct or not, but I assumed that the higher residual is, the higher change that one factor or two among factors are abnormal, causing the high residual. That is the reason why I would like to use regression model for this purpose
Back to  my example above, as I have history data set and I definitely knew which output was bad. Then I performed the model to see whether regression model could screen out bad product or not. With 2 models above, the model 2 was one that I was looking for since it pointed out data points with high residual. However, to get this model 2, I spend lot of time to make different trials until I get the proper model

statman
Super User

Re: How to pick independence parameters to optimize regression model

Developing models based on historical data is both challenging and dangerous. I suggest using the historical data to develop hypotheses that can be explored through experimental design. Historical data lacks context. For example, do you know the measurement errors associated with the value for each x and the Y? Do you know what other factors were doing during the data collection? Are there any data distortion issues? What about potential lagged factor effects? What about variation in raw materials? While your thinking is not wrong, it is dangerous. If you get a large residual value, that is indicative of a model problem, not "bad" data as you suggest. If you want tools to assess consistency of weight through time, sampling is a more effective tool.

While there is no one right way, when building models from historical data, I usually suggest an additive approach. Also, this cannot be done without SME. Start with first order and add order as appropriate. There are a number of statistics that can help (e.g., R-sq Adj, R-sq Adj-R-sq, RMSE, p-values, residuals, AIC, BIC, etc.) Residuals are the actual- predicted (from the model). They are used to provide insight into basic assumptions (NID(0, constant Variance). They are useful for identifying outliers, where the model does a poor job of predicting specific data points. They can also assess problems with the assumption of random errors with a mean of 0, again this suggests issues with the model. They can often detect model inadequacies, where perhaps a nonlinear term should be introduced. 

"All models are wrong, some are useful" G.E.P. Box

Recommended Articles