cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Discussions

Solve problems, and share tips and tricks with other JMP users.
Choose Language Hide Translation Bar
DualARIMACougar
Level III

Efficient RSM fitting and Stepwise Regression on large DoE datasets with minimal clicks in JMP Pro

Hello everyone,

I am working with a very large DoE dataset in JMP Pro and would like to evaluate it quickly with as few manual clicks as possible. I am still fairly new to JMP Pro, and until recently I have mainly analyzed DoE studies using standard Least Squares methods. Therefore, even after reading documentation and tutorials, I am still unsure about the most efficient workflow for my current task.

My design includes three factors and approximately 20 responses, grouped by different treatments. I would like to refit a Response Surface Model (RSM) for each response and analyze them consistently.

My main questions are:

  1. Is there a way to run Stepwise Regression or RSM fitting more efficiently without having to repeatedly click “Go” for each step?

  2. Are there tools or options in JMP that allow semi-automated or single-click workflows for multiple responses (e.g., Fit Definitive Screening / Fit Model presets, model dialogs, or other shortcuts)?

  3. As an additional question: if full automation is needed, would scripting (JSL) or batch processing be the best way to scale this up? If so, are there recommended starting points or templates for beginners?

My goal is to speed up the model evaluation and refitting steps as much as possible, ideally without heavy scripting.

Thank you in advance for your help and suggestions.

8 REPLIES 8
statman
Super User

Re: Efficient RSM fitting and Stepwise Regression on large DoE datasets with minimal clicks in JMP Pro

It is virtually impossible to provide specific advice on the information you provided.  Can you attach the experiment?  Or at least the experiment coded?

Sorry, I'm a bit confused by your query.  Your post suggests you have 1 large DOE data set.  Not sure I understand as you only have 3 factors? Why would you be setting up a workflow?  In any case the first step for any experiment is to determine if you created enough variation from a practical perspective.  Did the response variables vary enough to be interested in the analysis of the experiment?  If so, the next step is to do multivariate correlation of the many Y's.  At the same time check for any outliers using Mahalanobis outliers test.  Depending on the number of correlated Y's (if they correlate the analysis will be the same), you may have to fit Y's separately.  Stepwise is not a procedure for analysis of DOE.  It is for historical or observational data when you are searching for a model.  In DOE, you have a model in mind (or you should), so Fit Model is your platform.

"All models are wrong, some are useful" G.E.P. Box
DualARIMACougar
Level III

Re: Efficient RSM fitting and Stepwise Regression on large DoE datasets with minimal clicks in JMP Pro

Hi Statman,

thank you for the clarification. Let me explain my setup in more detail:

I am working with a three-factor RSM design, but for these three factors I have measured 25 different responses. These responses are grouped into five treatment groups (using the ‘By’ role). So while the factor space is small, the total number of response/treatment combinations is large. In addition, I have multiple datasets of this type (five DoEs in total), therefore the repetitive analysis steps add up significantly.

I fully understand that the dataset would be helpful and I love to share it but  I cannot share the data here due to CDAs, and anonymization is unfortunately not an option too.

I generated the designs using JMP’s Custom Designer and I am using the Fit Model platform for the analysis. My intention was to select the Stepwise personality rather than Standard Least Squares to introduce more objectivity using AIC/BIC criteria rather than p-value based selection,+ less clicks since I only have to click on the GO button. The stepwise personality still fits a polynomial model to my data but more automized, less bias and not only driven by p-value.

However, the practical issue is that for each response in each treatment group, I still need to repeatedly click the “Go” button. With many responses and multiple datasets, this becomes highly time-consuming. My question is simply whether this part of the workflow can be automated or accelerated.

My end goal is to fit all responses, evaluate them, and generate prediction profilers using the DOE fitting platform. I prefer not to remove outliers beforehand because I want to evaluate them after fitting and avoid biasing the analysis upfront.

So the core question remains:
Is there a way to run Stepwise (or Fit Model) across multiple responses and treatment groups without manually clicking “Go” for every response—either through existing UI shortcuts or via scripting/batch execution?

This would substantially reduce the number of manual clicks required.

Does this additional context help to answer my workflow question?

Also I analyze DoEs now since several years with JMP, always using the old school way with standart least squares. Now I have the JMP Pro version and Iam still hoping to have some options here, which saves time.

Victor_G
Super User

Re: Efficient RSM fitting and Stepwise Regression on large DoE datasets with minimal clicks in JMP Pro

Did you try to use CTRL+click on "GO" ?
Using CTRL before an action help broadcast the action to every similar button/box. It can also be used to change the settings in the Fit Stepwise estimation for all responses at the same time.

However, Bill's remarks on Stepwise selection still remains. If you have build a DoE with Custom Design platform, that means you have an apriori model for this experimental scenario. You should use it first, and use other modeling methods (with caution) as competitor models to compare the performances and the pros and cons of each.

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)
DualARIMACougar
Level III

Re: Efficient RSM fitting and Stepwise Regression on large DoE datasets with minimal clicks in JMP Pro

Hi Victor,

First of all thanks for your help as always!!

thanks this little combination helps here indeed. 
Regarding  your comment on the restriction of the fit model, I am a bit confused. 
From my understanding, both standart least square regression with manual p-value selection as well as stepwise regression fit a polynomial model to my data, the terms of the polynomial model are specified in the model dialog (in my case its a RSM with). 
I found this very nice video: (Mastering JMP) Deploying Stepwise Regression Methods in JMP and JMP Pro - YouTube.
and this explanation: 11.2 - Stepwise Regression | STAT 462

Here the differences between standart least square and stepwise regression are well explained and from my naiive conclusion its kind of the same+not only p value selection is used for stepwise but additionally BIC (or AIC) criterion is used, which makes sense, since the question, what variables are important for the model without increasing the complexity of the model (simply concluded :-)), avoids subjective p-value bias and automates the whole process of model building - giving more objectivity. Would you agree here? 

Still my question is not answered and I will repost my questions here and add some more:

1) Is there a practical and fast way to automize DoE model building in JMP Pro for huge datasets with tons of responses (simple Design done with custom designer, e.g. RSM, custom design, etc.)?
2) Is stepwise regression an option here?
3) Since I only need the final prediction profiler with all of my responses together, I face another problem: How to I safe the final prediction profiler without redoing the whole analysis. If I currently safe it as script, I have to redo everything, the prediction profiler after model building is not safed. 
4) In JMP Pro I have automatically in my script window the general regression option, which I also can use (And the hover explanation states, that this is the method of choice for all linear regression options). What is the advantage here over standart least square or stepwise regression? How can I automate the analysis in the general regression model builder the strg+click option does not work here)?

Thanks a lot for your helps in advance and I suppose that the question of automating DoE analysis and model building of huge datasets is not only an issue I face an dwould help a lot of people!


statman
Super User

Re: Efficient RSM fitting and Stepwise Regression on large DoE datasets with minimal clicks in JMP Pro

The best way to analyze a RSM is to use graphical contour plots.  If you are at the stage of optimization which is what RSM is intended for, then you should already know significance of first and second order model effects.

When doing Fit Model and assessing the model adequacy, p-value is only one of the statistics you should use (and only use it when you have some idea of what is being used to assess the MSE and that bit is representative).  Again before you do anything, assess the integrity of the data (multiple methods to do this).  Then the practical significance. You also need to assess R-square Adjusted (larger is better), delta between R-Square and R-Square adjusted (smaller is better otherwise you have an over specified model), RMSE (smaller is better).  And always assess the practical implications.  Is the model useable?

 

TBH, your situation is not very complex.  Handling the analysis shouldn't take very long (I had an experiment with 3281 responses once).

You can create a workflow, but the steps may be different depending on what you find as you go.

https://www.jmp.com/support/help/en/19.0/?os=mac&source=application#page/jmp/workflow-builder.shtml

 

"All models are wrong, some are useful" G.E.P. Box
DualARIMACougar
Level III

Re: Efficient RSM fitting and Stepwise Regression on large DoE datasets with minimal clicks in JMP Pro

Hi Statman,

thanks again for your explanations. I do understand the classical DOE perspective: in a three-factor RSM the model is predefined, the focus is on optimization rather than on model hunting, and things like adjusted R-square, RMSE, and practical relevance are important for adequacy. I am aligned with that.

My problem is not the statistics. It is the practical workload. In my field (protein stability work), I often have ~25 responses per dataset—aggregation levels, charge variants, hydrodynamic size, Tm, etc.—grouped by treatments. The DOE setup itself stays the same (three factors), but the actual response behavior changes with each protein. So I cannot simply reuse a Workflow Builder template; each protein dataset has to be evaluated individually, even though the quadratic model structure is identical.

I also understand contour or surface plots for RSM. These are helpful for intuition. But they are essentially qualitative, because they show the predicted surface in 2D:

y_hat(x1, x2 | x3 = constant)

where

  • y_hat is the predicted response from the model,

  • x1, x2, x3 are the three factors,

  • and “| x3 = constant” just means x3 is held fixed.

This is fine for a single response, but with 20–25 responses it becomes hard to base decisions on contour plots alone.

In practice, I rely more on the Prediction Profiler (and desirability) because it gives me a quantitative way to evaluate and optimize multiple responses at once. For example, for k responses I have predicted values y1_hat, y2_hat, …, yk_hat, and each has its own desirability function d1(y1_hat), d2(y2_hat), …, dk(yk_hat). The overall desirability that I try to optimize is:

D(x) = (d1 * d2 * … * dk)^(1/k)

with x = (x1, x2, x3).
This allows me to numerically optimize the factor settings for multiple stability readouts at once. That’s something contour/error surfaces can’t easily do when the number of responses is high.

So the core issue is not methodology. It’s the time spent clicking. To get from raw data to profiler and desirability-based decisions, I currently need more than an hour per dataset just to run the same quadratic model for each response. Across several datasets, this is four or five hours of mostly repetitive clicking—even though the model definition doesn’t change.

Workflow Builder doesn’t solve this for me, because even if the factors stay identical, each protein dataset yields different responses and I still need to review model adequacy individually.

That’s why I’m looking for a more efficient way—ideally via JSL—to:

  • loop over a list of response columns,

  • apply the same quadratic RSM model in Fit Model and optimize by minimizing for example p or maximizing R2 adjusted,

  • and automatically generate profilers (and desirability)

without having to press “Exclude” again and again.

So my question is simply: is there a way in JMP Pro to batch this process? If you know a JSL pattern or example script that fits this scenario, it would help me a lot. I want to stay aligned with DOE best practice—I just want to reduce manual clicking in the process.

statman
Super User

Re: Efficient RSM fitting and Stepwise Regression on large DoE datasets with minimal clicks in JMP Pro

Do any of the response variables correlate?  If so this may lessen the load.  If Y's don't correlate, you will have to make compromises regarding priorities.  The models will be different. I don't know how to write a script that integrates logical, rational thinking into decision making regarding model refinement.  There may be some coding that makes you more efficient, but I don't know it.

 

If it were me, I'd be investigating what makes each protein different and why/how should the factors be set differently for each protein (perhaps proteins can be categorized?), but I am not an SME.

"All models are wrong, some are useful" G.E.P. Box
Victor_G
Super User

Re: Efficient RSM fitting and Stepwise Regression on large DoE datasets with minimal clicks in JMP Pro

Hi @DualARIMACougar,

 

As @statman  highlighted in his responses, there are many metrics to evaluate and compare models, depending on your objectives and the modeling emphasis (explanatory, predictive, mix of both, etc...). You can read some previous discussions about this broad topic : Analysis of split plot design with full factorial vs RSM / best model / Statistical Significance / ... For optimization objectives, you might put more emphasis on predictive performances than other metrics performances, so RMSE and Information criteria (AICc, BIC, ... to avoid using too complex models for the predictive task) might be sensible indicators.
I also think that instead of creating several designs, each for a different proteins, these different proteins could be used as a categorical factor in the design, with other factors being nested within protein nature/type. If you have numerical characterisations/measurements of these proteins, or if you can use chemical structures to differentiate them and calculate molecular descriptors, you can cluster the proteins based on these values (to better understand the differences and similarities in their results), or use these values as covariate to create your design (to make sure you have a sample of proteins that cover and is representative of the diversity of the proteins population under study). Using continuous factors instead of categorical factors help increase the inference space. You can check these two presentations for inspiration: Increase Efficiency and Model Applicability Domain When Testing Options That Are at First Glance Mul...
Coding with Continuous and Mixture Variables to Explore More of the Input Space (2022-US-45MP-1103)  

The more "automation" you put in the modeling phase, the more error/risk-prone your analysis may be subject to (and the more scrutiny in the modeling/prediction results and validation you may have to do). For example, you may be able to fit several dozens of models in JMP for your responses, but you still need to assess the validity of each model, particularly regarding regression assumptions validity.

However, concerning your other points, I can suggest some other options:

  • If you have JMP Pro, I would recommend using the platform Generalized Regression, with the estimation methods Normal Pruned Forward and Two Stage Forward regression, with AICc criterion (you can use CTRL+click to fit the same type of model on all your responses). The models I obtained through this modeling type are often very satisfactory, and I need to dive deeper when some model metrics seem weak, or if residuals show a strange (non-random/non-normal) pattern. If you fit several responses, you can activate the Profiler for all responses in the red triangle once the models have been fitted, as well as other Profiler types. You can also save these Prediction formula in your table or Formula Depot, and in case of several competing models, use Model Comparison platform (directly from Formula Depot or in Predictive Modeling menu) to better evaluate and compare the competing models.
  • If several of your responses are correlated, you could also use Partial Least Squares regression modeling, that can leverage the correlations between your Y's to better model the link between the X's and the Y's.
  • As your emphasis may be on predictive performances, you can also try to fit a flexible model using different Machine Learning modeling options, like Gaussian Process, Support Vector Machines, or Random Forest. These Machine Learning models are easier to launch, rely on less or weaker assumptions, but do require some efforts on a proper validation strategy, to avoid overfitting. The platform Model Screening can also be helpful to screen, compare and select several ML models quickly. However, if your designs are model-based for 2nd order surface response, these modeling options may not be performing well.
  • Finally, as you're using an optimization design, you're interested into optimizing several responses thanks to the identified factors. You could maybe as a first attempt try to reduce and visualize your responses space, with PCA, other (nonlinear) multivariate projections (like UMAP) or Multidimensional Scaling, to better understand the similarities and trade-offs of your experiments. If you have benchmarks, you can add them in the table before launching this type of analysis, to visualize and extract the distances between your experiments and these benchmarks. Once you have reduced your responses space, you can fit some models to these transformed "aggregate" responses.

Once you have models fitted to the responses, you can use the Simulator to generate synthetic data points in your experimental space and use the prediction formula of your models to predict the response values (to see how close they match your objectives), or even save and use the desirability formula columns for each response to calculate desirabilities on this synthetic dataset, and better highlight the Pareto-optimal front of the best candidates (either synthetic data or real experiments): data points that show the best overall desirability value for different responses trade-off/importances.

 

Hope these suggestions may help you,
Enjoy end of year celebrations !

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

Recommended Articles