cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Choose Language Hide Translation Bar
AAYH
Level II

Bad RMSE for Validation data

Hi

Using "Fit Model" I have created a model to identify moisture content in a product using different process parameters. The model is fine (Rsq=0.97) and a fairly low RMSE. The variables used all have Logworth higher than the limit.  The problem arises when I want to take in new data for validation. Here the prediction is very bad. The data that I want to validate is not very different from the data used in test-set. When I include the validation data as testset-data, they fit very well into the model the moisture prediction error is similar to the "orginal" data

I have a model consisting of both main effects and interactions. What I am doing wrong?

1 ACCEPTED SOLUTION

Accepted Solutions
Victor_G
Super User

Re: Bad RMSE for Validation data

Hi @AAYH,

 

First, is this DSD linked to the problem of RMSE change with validation data or is this different and not particularly related ? Just trying to see if this is the example dataset to illustrate your problem, or something in addition of your problem. What is your objective with this DoE ? There is a missing value in your datatable for the second uncontrolled variable.

 

If I understand well, you have added a blocking factor (in the DSD creation you have created it as a categorical factor ?), and 2 uncontrollable variables you have recorded but are not randomized/controled in a DoE way.

As you already have seen it, you have strong correlations between your two uncontrollable variables, and between Block and uncontrolled variables 1/2, which may greatly impact the predictive performance of your model on new data (script "1. Multivariate" in datatable attached).

I have added some column properties on your block random factor and uncontrollable variables (so that they can be used in JMP platforms correctly), and then realized two models :

  • One with the Mixed model personality, specifying "Block" as a random effect (script "2.a. Fit Mixed"),
  • One with the Standard Least Squares personality with "Block" as a random effect (script "2.b. Fit Model (with random Block)").

 

Not surprisingly, the two models provide similar results, and as you expected, uncontrollable variables 1 and 2 seem to have an important effect on the response Quality. The random block factor doesn't seem to be significant.

The models in itself are quite correct (depending on the precision of the model you expect), with high R² (0,87) and quite low RMSE (around 0,5). However, as already seen in the Multivariate platform (correlations), some of your factors are linearly dependent of each others, which creates high VIF (around 20 for the two uncontrolled variables) and may result in variance increase in predictions:

 

Victor_G_0-1675768033684.png

One option to reduce this collinearity could be to create principal components out of these 2 variables (script "3.a. PCA of uncontrolled variables 1&2"), and realize the same type of model as before (but replacing uncontrolled variables 1&2 by their 2 principal components, script "3.b. Fit Model with PCs") : 

Victor_G_1-1675768247045.png

This won't change the performances of the model in terms of R² and RMSE, but decreasing the VIFs help reduce the variance in parameter estimations (for parameters involved in this collinearity situation).

Concerning your initial question, whether you can use the data you have now, it all depends of your objective (explanation and/or prediction), and how you evaluate the "usefulness" of your data (prediction precision ?).

From this first datatable, there are already some important things to notice, but without further informations, it's hard to interpret or conclude on the use of the data (representativeness of the ranges of uncontrolled variables 1 and 2, or correction needed for the missing value for example...). But it's an interesting first step on which you could augment your design, to better take into account your 2 uncontrolled variables and make them controlled by being factors in your experimentation.

 

Note that what I have done may not be the best approach depending on your objectives and needs, and that other options could be also available for the same tasks.

Victor GUILLER
Scientific Expertise Engineer
L'Oréal - Data & Analytics

View solution in original post

6 REPLIES 6
Victor_G
Super User

Re: Bad RMSE for Validation data

Hi @AAYH,

 

Welcome in the Community !

It seems that you're experiencing overfitting on your training data. This may be an indication that your model is too complex and doesn't generalize well to new data. It's a little hard to help you without any example data (for example anonimized data) to investigate in details the steps you have done and type of data you have. Nevertheless, here are some questions to start :

  • Have you checked correlations between your inputs (menu Analyze -> Multivariate methods -> Multivariate) or
  • Have you checked multicollinearity among your process factors (in "Fit Model" platform, panel "Parameter Estimates (jmp.com)", right click on the columns parameters and click on Columns -> VIF to check their values) ?

This could be a direction to investigate, as correlations or multicollinearity can severely "damage" linear models performances if there is no precaution.


If you have correlation or multicollinearity among your factors, it may be safer to :

  1. Choose a model able to handle multicollinearity (Partial Least Square, Bootstrap Forest for example), or
  2. Pre-process your inputs with a PCA and use the principal components in your linear model, or
  3. Use penalized methods (JMP Pro, Generalized Regression) like LASSO, Ridge or Elastic Net to create more robust linear regression models.

Let us know what this first investigation bring, and if hopefully you can provide some example data, it would be easier for other members to take part in this discussion,

Victor GUILLER
Scientific Expertise Engineer
L'Oréal - Data & Analytics
AAYH
Level II

Re: Bad RMSE for Validation data

Hi Victor Thank you for your response and I think you are right. I will look into it ASAP.

statman
Super User

Re: Bad RMSE for Validation data

First, I'm not sure I understand the issue?  Are you able to post your data table? What confuses me is the apparent conflict depending on how you are evaluating the "new" data? Could you describe how you did this? What is the delta from the prediction and actual? (residuals)  Have you analyzed residuals?

 

In addition to Victor's excellent thoughts regarding multicollinearity, here are mine:

1. What is the source of the data you used to run fit model?  Is this production data?  Is it from an experiment?

2. How representative is your data of future conditions?

3. As Victor suggests, there are a number of questions about the data.   There is no one statistic that provides all of the information about our model.  R-sq by itself has very little utility (It will always increase as you increase DF's in your model).  You want to compare R-sq with R-sq adjusted.  This will provide insight into over-fitting the model.

 

"All models are wrong, some are useful" G.E.P. Box
AAYH
Level II

Re: Bad RMSE for Validation data

Hi Statman.


Thank for taking time to answer my question and thank you for your valid input. I will look into it ASAP. 

AAYH
Level II

Re: Bad RMSE for Validation data

Hi again.

 

I made a definitive Screening Design where I blocked due to different days. Now, I know that I might made a mistake by not taking the uncontrolled variables into consideration from beginning. 

I have 3 variables (UB1-3) which I changed during two different production runs (Block 1 and Block 2). Uncontrolled Var. 1 and 2 are measured but I can't change them. My Quality variable is measured at the end of the process. Uncontrolled variable 2 is highly dependent on day/Block and will affect Uncontrolled Var. 1 as well as UB1 to UB3.

I just curious to understand whether I can use the data I have now or that I need to do another experiment. Thank you very much for your time. 

Victor_G
Super User

Re: Bad RMSE for Validation data

Hi @AAYH,

 

First, is this DSD linked to the problem of RMSE change with validation data or is this different and not particularly related ? Just trying to see if this is the example dataset to illustrate your problem, or something in addition of your problem. What is your objective with this DoE ? There is a missing value in your datatable for the second uncontrolled variable.

 

If I understand well, you have added a blocking factor (in the DSD creation you have created it as a categorical factor ?), and 2 uncontrollable variables you have recorded but are not randomized/controled in a DoE way.

As you already have seen it, you have strong correlations between your two uncontrollable variables, and between Block and uncontrolled variables 1/2, which may greatly impact the predictive performance of your model on new data (script "1. Multivariate" in datatable attached).

I have added some column properties on your block random factor and uncontrollable variables (so that they can be used in JMP platforms correctly), and then realized two models :

  • One with the Mixed model personality, specifying "Block" as a random effect (script "2.a. Fit Mixed"),
  • One with the Standard Least Squares personality with "Block" as a random effect (script "2.b. Fit Model (with random Block)").

 

Not surprisingly, the two models provide similar results, and as you expected, uncontrollable variables 1 and 2 seem to have an important effect on the response Quality. The random block factor doesn't seem to be significant.

The models in itself are quite correct (depending on the precision of the model you expect), with high R² (0,87) and quite low RMSE (around 0,5). However, as already seen in the Multivariate platform (correlations), some of your factors are linearly dependent of each others, which creates high VIF (around 20 for the two uncontrolled variables) and may result in variance increase in predictions:

 

Victor_G_0-1675768033684.png

One option to reduce this collinearity could be to create principal components out of these 2 variables (script "3.a. PCA of uncontrolled variables 1&2"), and realize the same type of model as before (but replacing uncontrolled variables 1&2 by their 2 principal components, script "3.b. Fit Model with PCs") : 

Victor_G_1-1675768247045.png

This won't change the performances of the model in terms of R² and RMSE, but decreasing the VIFs help reduce the variance in parameter estimations (for parameters involved in this collinearity situation).

Concerning your initial question, whether you can use the data you have now, it all depends of your objective (explanation and/or prediction), and how you evaluate the "usefulness" of your data (prediction precision ?).

From this first datatable, there are already some important things to notice, but without further informations, it's hard to interpret or conclude on the use of the data (representativeness of the ranges of uncontrolled variables 1 and 2, or correction needed for the missing value for example...). But it's an interesting first step on which you could augment your design, to better take into account your 2 uncontrolled variables and make them controlled by being factors in your experimentation.

 

Note that what I have done may not be the best approach depending on your objectives and needs, and that other options could be also available for the same tasks.

Victor GUILLER
Scientific Expertise Engineer
L'Oréal - Data & Analytics