Re: removing terms from a model following a designed experment

gchesterton · Nov 27, 2023 10:10 AM

Ok, suppose I've conducted a full factorial design with four 2-level factors. When fitting the model I've included all 2-way interactions. Some of the factors' main effects and interactions are not significant at 0.05. If I then remove the least significant factors (those at the bottom of the effect summary table), some of the non-significant terms become significant. If I were trying to develop the 'best' model, I would feel comfortable doing this. But if I am trying to determine whether a factor plays a role in the response variable then I'm not sure this is legit. Is there a proper rationale, in a DOE context, for removing insignificant terms and then describing the remaining terms as significant?

statman · Nov 27, 2023 8:39 AM

Here are my thoughts in general (specific advice would require a more thorough understanding of the situation):

First, I will assume you have an un-replicated factorial with 16 treatments. The model includes all possible terms to 4th order.

You bring up an important concept with respect to experimentation and the F-test (or any test perhaps). The typical significance test in experimentation is to compare the MS of the model term with the MSe (error). This is the F-ration or F-value. The important questions are: How was the error term estimated? How representative of the true error is the estimate? Are the comparisons that are being made useful and representative?

If you remove insignificant terms from the model (lack of fit), you are potentially biasing the MSe lower (small SS divided by DF). Hence when you compare the MS of the model term with the smaller MSe you get larger F-values (and smaller p-values).

A quick read:

https://www.additive-net.de/images/software/minitab/downloads/SCIApr2004MSE.pdf

I recommend first assessing practical significance of the model terms using Pareto charts of effects AND use Daniel's method of evaluating statistical significance for un-replicated experiments (Normal/Half Normal plots) and perhaps augmented with Bayes plots (Box). This will give you both practical significance and statistical significance without bias.

Daniel, Cuthbert, "Use of Half-Normal Plots in Interpreting Factorial Two-Level Experiments", Technometrics, Vol. 1, No.4 November 1959

Once you have determined the factors/interactions that are active in the experiment, simplify/reduce the model. The purpose of simplifying the model is 2 fold:

1. to get a more useful model for iteration and prediction

2. to get residuals to help assess model adequacy and whether any assumptions were violated.

Note: you do not re-assess statistical significance for the simplified model

"All models are wrong, some are useful" G.E.P. Box

gchesterton · Nov 27, 2023 11:48 AM

Thanks. Your point about biasing the MSe is at the heart of my concern about my conclusions about the significance of the remaining terms ... that I probably have a higher type I error rate than I think I do. So I may be falsely claiming a factor effect.

I'll add that it was a replicated experiment with four replicates, since we expected poor signal to noise. Does that help guard against false claims of term significance?

The factors were binary/categorical and the response was continuous.

Victor_G · Nov 27, 2023 9:08 AM

To add some remarks and comments in addition of @statman answer.

Your topic seems very similar to models evaluation and selection. In this process, statistical significance may be one metric, but not the only one (always add domain expertise !).
Complementary model's estimation and evaluation metrics, like log-likelihood, information criteria (AICc, BIC) or model's metrics (explanative power through R2 and R2 adjusted, predictive power through MSE/RMSE/RASE, ...) offer different perspective and may highlight different models.
You can then select based on domain expertise and statistical evaluation which one(s) is/are the most appropriate/relevant for your topic, and choose to estimate individual predictions with the different models (to see how/where they differ), and/or to use a combined model to average out the prediction errors.

"All models are wrong, but some are useful" is a quite adequate quote when dealing with multiple possible models and trying to evaluate and select some to answer a question.

Hope this additional answer may help you,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

statman · Nov 27, 2023 12:20 PM

Why 4 replicates? What was changing within and between replicate? Does the RMSE look reasonable and comparable to the variation you see in the process? Is the process stable?

The commonest of defects in DOE are (paraphrased from Daniel):

Oversaturation: too many effects for the number of treatments
Overconservativeness: too many observations for the desired estimates
Failure to study the data for bad values
Failure to take into account all of the aliasing
Imprecision due to misunderstanding the error variance.

Point 5 is important, particularly if you are relying on p-values for assessing significance.

Victor's advice regarding use of other statistics to determine an appropriate model are worth consideration as well.

"All models are wrong, some are useful" G.E.P. Box

gchesterton · Nov 27, 2023 01:20 PM

The experiment was "open sourced" where volunteer participant teams conducted four runs (randomly assigned treatments) from the master design matrix. I blocked for the team effect, as well as other nuisance factors such as the scenario (of which there were four). Even without these nuisance factors, I expected a team's score (the response variable) to vary despite the factors' effects. These nuisance factors were an unfortunate compromise between "realism" with human subjects and a more tightly controlled experiment.