I'm happy to share my thoughts, but I don't have a sufficient understanding of your situation to provide specific advice (Like should you augment, replicate or what), so I'll speak in generalities which naturally is an oversimplification:
What you are doing in an experiment is trying to compare the effects of terms in your predicted model (main effects, 2nd order linear, etc.) to random error (noise) and to each other. Since experiments are, by design, narrow inference space studies (you have essentially removed the time series and so the effects of noise that change over time are naturally excluded) you have to exaggerate the effects of noise to be representative of future conditions.
“Unfortunately, future experiments (future trials, tomorrow’s production) will be affected by environmental conditions (temperature, materials, people) different from those that affect this experiment…It is only by knowledge of the subject matter, possibly aided by further experiments to cover a wider range of conditions, that one may decide, with a risk of being wrong, whether the environmental conditions of the future will be near enough the same as those of today to permit use of results in hand.”
Dr. Deming
Unfortunately, as the experimental conditions become more representative of future conditions, the precision of detecting factor effects can be compromised. One of the strategies to improve precision lies in the level setting for your design factors (hence the advice in screening designs is to set levels bold, but reasonable). Strategies to handle noise are critical to running experiments that are actually useful in the future. There are a number of strategies to handle noise (e.g., RCBD, BIB, split-plots (for long-term noise) and repeats (for short-term noise). When you have significant resource constraints, your options are limited.
The bottom line is if you are using only statistical significance tests, the less representative your estimate of the random noise, the less useful the statistical test (it is less than useful to have something appear significant today and insignificant tomorrow). Stepwise uses statistical tests (of the data in-hand) and, again, if that data is not representative, those tests may be worse than useful (commit both alpha and beta errors). It is kind of interesting how there is a focus solely on the alpha error (p-values for example) in screening situations when, perhaps, the most dangerous error is the beta.
I'm not sure what you mean by "Could you elaborate on this and how to perform this in jmp?" Do you mean design appropriate experiments or perform appropriate analysis?
All analysis is dependent on how the data was acquired. For your analysis, since you have an unreplicated design, run Fit Model, assign ALL DF's to the model (saturate the model). Run Standard Least Squares personality. Once you have the output, select the red triangle next to each response and select Effect Screening. There you will find the options of Normal plot, Pareto plot and Bayes plot (see additional script above).
"Results of a well planned experiment are often evident using simple graphical analysis. However the world’s best statistical analysis cannot rescue a poorly planned experimental program."
Gerald Hahn
ps, you might enjoy reading Cuthbert Daniel as he never took a class in statistics, but is a profound experimenter.
"All models are wrong, some are useful" G.E.P. Box