Discussions

NominalGemsbok3

As part of a master’s thesis, a Definitive Screening Design with six continuous variables was conducted. We generated a standard design in JMP with 17 runs (+3 additional runs at the zero level).

The results surprised me in that a considerable number of interaction and quadratic effects are highly significant. My initial suspicion was overfitting, but I cannot find any indications supporting that. We have an excellent R-squared (which is expected), no elevated VIFs, an unremarkable Durbin–Watson test, etc. Furthermore, a PRESS value of 0.044 is achieved.

These findings remain the same or very similar even when strict heredity is disabled in the DSD analysis. In that case, 2–3 additional interaction or quadratic effects appear, which are again highly significant.

Based on everything I know, I currently see no reason to doubt the validity of the results. Or am I overlooking something?

Victor_G · Apr 16, 2026 4:04 PM

Hi @frankderuyck and @NominalGemsbok3 ,

There is no ultimate best model, for multiple reasons: choice of performance metric, threshold (for p-value for example), estimation method, etc ... And there are not enough unique treatments (degree of freedom) in a DSD to estimate all effects, so you can easily end up with different but competing models with good performances. You could see a DSD as a supersaturated design type for response surface model.

As @frankderuyck mentioned, due to the presence of partial aliases/correlations between interaction effects, and also between quadratic effects due to the design structure, you can't be 100% sure about the "real" impact of interaction effects and quadratic effects on your target that are detected (by any modeling methods), unless you add runs to better inform your model. You can have however more confidence about main effects, as the design structure avoids having any correlation between main effects and between main effects and higher order effects, so you can estimate them without any bias.

"All models are wrong, but some are useful"
I tried to create a specific visualization called raster plot (see Raster plots or other visualization tools to help model evaluation and selection for DoEs to see how it has been created) on this example to show this multiplicity of models due to the combinatorial explosion of possible terms included in the model (besides the intercept, there are 27 possible effect terms: 6 main effects, 15 two-factor interactions and 6 quadratic effects to choose from), using the platform Stepwise and the option "All Possible Models", (up to 10 terms in the model with strong heredity assumption). Here is the result of the models, sorted by Rsquare value, which shows which terms (in columns) are included for each model (each line):

I prefer using an information criterion for comparing multiple models, such as AICc (the lower the better), as it penalizes the use of too many terms and allow a better comparison for models with different complexities:

As you can see, most of the models do agree on the presence of the main effects of the first 5 factors. For factor 6, the results are different and there is no obvious pattern of presence of this main effect. For interactions and quadratic effects, it's also hard to see some strong patterns, except that some higher order effects don't seem to be included most of the time: interactions factor 1 x factor 6, factor 2 x factor 4, factor 3 x factor 5, factor 3 x factor 6, factor 4 x factor 6 and factor 5 x factor 6. For quadratic effect, factor 6 x factor 6 is absent most of the time in models. If we zoom in a little on the best models according to Rsquare value, there are some interesting observations on higher order effects :

Interactions Factor 1 x Factor 3, Factor 4 x Factor 5 tends to be often chosen in models. Moreover, quadratic effects for factor 2 and factor 4 are also often selected. These results tend to agree with the results I obtained from the Fit Definitive Screening platform, with the same main effects and higher order effects detected:

When limiting the comparison to three different estimation methods, you can also see this situation of different and equivalent models and terms combination. For example with Fit Definitive Screening, GenReg Normal Pruned Forward and GenReg Two Stage Forward estimation methods, we can compare both the performances of the models and the terms included:

Performances: here with Rsquare and Rsquare adjusted for explainative purposes (how much the model explains the variability in the response):

We can see that the first two methods show similar performances.

Terms in the models:

Even if the two first estimation methods provide models with similar methods, the terms included for higher order effect are different. They only agree on the inclusion of interaction effect Factor 1 x Factor 3.

So a reasonable follow-up would be to discuss with domain experts about which model(s) are the most sensible/reasonable, and use the platform Augment Designs to confirm and/or precise the most relevant model. You can for example augment it and specify the model for which you want to estimate the terms.

Please find the table with all scripts used in my response.
Hope this answer will help you,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

View solution in original post

frankderuyck · | Posted in reply to message from Victor_G 04-16-2026

A DSD still is a screening design appropriate to screen out strong effects - particularely main effects - from a larger set of potential factors; it will also give an indication for potential quadratic effects and interactions be aware that the latter are correlated (!) cfr. color map on correcaltions. If you get three or less active effect the DSD will yield a RSM. But! if > 3 significant effects you will need to augment the DSD to determine pure interaction & quadratic effects. Therefore, if after brainstorming with experts, there arer probably interaction effects I always start with minimal or low #DSD sufficient to detect strong effects and there are enough runs left in budget for augmentatio and de-aliasing. Also I would not spent too much effort on center point replication.

View solution in original post

MathStatChem · | Posted in reply to message from NominalGemsbok3 04-13-2026

For a 6 factor design, this is a relatively "large" design and because there are only 27 estimable effects in a response surface with 6 factors you have a good change of estimating many of those effects. Looking at the centerpoint runs, they range from 3.915-4.018, so a residual standard deviation of about 0.02. So any effect that is much larger than 0.2 is going to have a good chance of being significant. Looking at the fold over points, I see differences of ~0.6.

So yes, this is entirely possible. Sharing the actual data table would help with exploring this further.

NominalGemsbok3 · | Posted in reply to message from MathStatChem 04-13-2026

Thanks for your response. What do you mean with the "actual data table"?

statman · Apr 13, 2026 1:44 PM

There is essentially no way to respond as there is too little insight into the actual experiment. Is this an actual physical experiment (or simulation)? Was measurement error quantified á priori? How was your MSE estimated? Was the change in response of any practical value? Statistical significance is a conditional statement. It is not meaningful unless the estimate of the MSE is a reasonable estimate of reality (and you understand what constitutes the estimate).

"All models are wrong, some are useful" G.E.P. Box

NominalGemsbok3 · | Posted in reply to message from statman 04-13-2026

It is a real and actual physical experiment. The MSE was estimated from the residuals of a linear model. To validate this estimate, pure error was obtained from replicated center points (variance ≈ 0.002). A lack-of-fit test indicated no significant model inadequacy (p = 0.79), with the mean square for lack of fit being smaller than that of the pure error. Therefore, the MSE can be considered an appropriate estimate of the error variance.

Victor_G · Apr 13, 2026 4:24 PM

Hi @NominalGemsbok3,

I may have complementary propositions and remarks regarding the results you have obtained.

On a practical note, there are better metrics to assess any possible overfitting scenario: use Rsquare adjusted and the delta between Rsquare and Rsquare adjusted to be minimized (Rsquare tends to increase with the number of terms in the model, whereas Rsquare adjusted considers the complexity/number of terms in the model and penalize too complex models), information criterion (like AICc and BIC) which helps compare different models balances between accuracy and complexity. PRESS values (Rsquare and RMSE) are also good indicators about model performances, as they are based on Leave-One-Out cross-validation, so you might have a glimpse of model performances for points not used in model training. All these metrics help model selection and avoid falling into the trap of the "Cult of Statistical Significance".
Also, statistical significance is different from practical importance of the effects: if you have a very high signal/noise ratio, you might detect statistically effects more easily, even if they have no practical importance. Use domain expertise to filter the relevant and important effects and guide the model selection process.
Regarding the number of active effects, what is reassuring is to see that:
- The design and modeling approach are correctly chosen for the number of active terms detected. From the JMP Help on Fit Definitive Screening : "A minimum run-size DSD is capable of correctly identifying active terms with high probability if the number of active effects is less than about half the number of runs and if the effects sizes exceed twice the standard deviation". With 9 active terms detected out of 17 original runs (+3 added centre points) and a relatively low RMSE, you seem to be in a good situation like mentioned.
- They do respect Effect Hierarchy principle: the higher the order of the effect, the less likely this effect will explain variation in the response. If we look at your situation, 5/6 (83%) of main effects are detected as significant, as well as 3/15 (20%) of the two-factor interactions and 1/6 (16,7%) of the quadratic effects. The proportion of the active effects may seem high, but it tends to follow the same regularity between the order of the effects as in the paper from Xiang Li, Nandan Sudarsanam, and Daniel D. Frey (March 2006) "Regularities in Data from Factorial Experiments" https://doi.org/10.1002/cplx.20123

It seems your results are the conclusion of a careful selection of potentially relevant predictors, a low level of noise in experimentation and response measurement, an appropriate design choice and modeling approach. Congrats !

Hope this answer will help you trust these positive results,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

NominalGemsbok3 · | Posted in reply to message from Victor_G 04-13-2026

Thank you very much for this encouraging and detailed response.

frankderuyck · | Posted in reply to message from NominalGemsbok3 04-13-2026

Is it possible to send a coded JMP file? I would like to try some GENREG models

NominalGemsbok3 · | Posted in reply to message from frankderuyck 04-14-2026

Hello,

here an coded JMP file.

statman · | Posted in reply to message from NominalGemsbok3 04-14-2026

I did a quick analysis (fit model) and my results are completely different than yours? Three main effects are the largest effects. I added the fit model to your data table (Tabelle1)

"All models are wrong, some are useful" G.E.P. Box

Discussions

"Surprising" results in an DSD-Design

Re: "Surprising" results in an DSD-Design

Re: "Surprising" results in an DSD-Design

Re: "Surprising" results in an DSD-Design

Re: "Surprising" results in an DSD-Design

Re: "Surprising" results in an DSD-Design

Re: "Surprising" results in an DSD-Design

Re: "Surprising" results in an DSD-Design

Re: "Surprising" results in an DSD-Design

Re: "Surprising" results in an DSD-Design

Re: "Surprising" results in an DSD-Design

Re: "Surprising" results in an DSD-Design

Recommended Articles