Solved: Re: Prediction interval for DOE model- Mutiple regression model

Justin_Bui · Jun 8, 2023 5:55 PM

Hi all,

I'm good with using intervals of simple regression models like: confidence intervals (to predict mean of future samples) & prediction interval (Predict all future observations- single pcs)

But a question when working with DOE-multiple regression model & Prediction interval

For example, I have a model when build by DOE 3 factors. What did I do is for each run, I repeat by building 10 sample and using mean of them to put in DOE table (they are normally distributed with std Deviation is 0.3)

With the result, The chart with the line red area shows me the confidence intervals in each predicted response value=> predict mean of samples. I get it.

But when I want to have the prediction intervals. Range For every single observation I will have in the future. I'm a bit confuse between 2 options.

- OP1: For every single observation in the future will be in the range of : confidence interval + RMSE

- OP2: For every single observation in the future will be in the range of : confidence interval + RMSE + 0.3 (std deviation from Repetition during the test.

Which one is correct?

Is my repetition data require me to add more variation in future prediction because I tend to use repetition to remove noise from my test at first ?

Is there any function in JMP that helps me to calculate this prediction interval automatically?

Thank you all for helping

Mark_Bailey · Oct 11, 2022 01:37 PM

The Actual by Predicted plot shades the confidence interval on the mean response. The prediction interval is available as a column formula through the red triangle menu at the top of Fit Least Squares. See the documentation for the various Save Column commands in this platform to understand the commands and options available. Here is the specific information you need:

Indiv Confidence Limit Formula

Creates two new columns called Lower 95% Indiv <colname> and Upper 95% Indiv <colname>. These columns contain both the formulas and the values for lower and upper 95% confidence limits for individual response values.

Note: If you press Shift while selecting the option, you are prompted to enter an α level for the computations.

View solution in original post

Victor_G · Oct 11, 2022 8:58 AM

Hi @Justin_Bui,

May I propose a third option ?
Since you already have a good model and you know the uncertainty of your response, why not using the Simulator (jmp.com) (red triangle next to Profiler to find the option) ?
You could specify the standard deviation of your response (0.3 or let the default value JMP gives you which is RMSE from fitting the current model) and eventually introduce uncertainty on your factors, and create intervals based on Monte-Carlo simulations.

I think this alternative could be a more straightforward and elegant solution for your problem.

Hope this first answer will help you,

Victor GUILLER
L'Oréal Data & Analytics

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

Justin_Bui · Oct 12, 2022 10:38 AM

Thanks Victor, It's also a great option when variation in my repetition can come from the factor level variation.
It's a good option to go

statman · Oct 11, 2022 9:49 AM

Here are my thoughts (though you may disagree and decide to ignore them):

I first want to address the repeat measures. Repeats (or within treatment measures) are used for 2 purposes:

1. Increase the precision of design without negatively restricting the inference space. Averaging reduces the short-term noise component and therefore increases the precision of the design (e.g., measurement errors, within part, within batch, etc). Before you summarize the data you should assess the variability. Is an average an appropriate statistic? Graphical techniques may be very useful. Looking at the distribution may help, but it can also hide time series effects. You might want to assess how consistent the variation in the repeats are across the treatments (perhaps range charts).

2. If your problem is a variation problem, not a mean problem, then the appropriate response variable must be a measure of variability. Repeats enable you to get an estimate of the variation within treatment (e.g., range, standard deviation, variance). Using measure of dispersion as a response variable when analyzing the experiment will help to understand factor effects on variation. Again the nature of the data should be evaluated before determining the appropriate enumerative statistic.

With respect to prediction equations and associated confidence intervals. Extrapolation of the results of an experiment is an engineering or managerial decision, not a statistical one. The confidence intervals in your plot are for the data in hand and could have nothing to do with what you will get. What determines the usefulness and effectiveness of the model you create as a result of the analysis of the data in hand depends on how representative the data in hand is of future conditions. This is greatly impacted by how noise was handled during the experiment (noise in the inference space and noise that was changing during the experiment).

“Analysis of variance, t-test, confidence intervals, and other statistical techniques taught in the books, however interesting, are inappropriate because they provide no basis for prediction and because they bury the information contained in the order of production. Most if not all computer packages for analysis of data, as they are called, provide flagrant examples of inefficiency.”

Deming, W. Edwards (1975), On Probability As a Basis For Action. The American Statistician, 29(4), 1975, p. 146-152

There is no answer to your question as to which approaches are "correct". There are multiple ways to develop prediction equations that take into account uncertainty in the future. Each of has their own experiences and biases as to which method is most useful/effective, but I have found the methodology I use is situation dependent. No one knows the right one á priori. Try multiple methods and then run the process and assess which model is appropriate. Your focus should be on the data collection process rather than on a technique in the data analysis process.

"All models are wrong, some are useful" G.E.P. Box

Mark_Bailey · Oct 11, 2022 01:37 PM

The Actual by Predicted plot shades the confidence interval on the mean response. The prediction interval is available as a column formula through the red triangle menu at the top of Fit Least Squares. See the documentation for the various Save Column commands in this platform to understand the commands and options available. Here is the specific information you need:

Indiv Confidence Limit Formula

Creates two new columns called Lower 95% Indiv <colname> and Upper 95% Indiv <colname>. These columns contain both the formulas and the values for lower and upper 95% confidence limits for individual response values.

Note: If you press Shift while selecting the option, you are prompted to enter an α level for the computations.