Good morning, I'm currently learning JMP and about DSD/optimization. I ran my first experiment and would like feedback on the analysis and next steps. I'm using JMP Pro 17.
The experiment objective: Maximize the capture/recovery of bacteria to magnetic particles.
The setup: I picked out the 7 variables I thought could influence the outcome and tested them using 22 runs (2 blocks - it takes me 2 days to run that many samples). The model returned that 3 variables were significant but it appears that 2/3 of the variables could benefit from increasing the testing range. I say this based on the prediction profiler graphs. This is possible and feasible (to an extent) so I can retest them with a wider range.
1. Do I need to rerun the DSD with all 7 variables or should I just do a central composite design with the significant variables that covers the new ranges of those 2 variables? I'm leaning towards the CCD since I don't think the increased ranges will influence the insignificance of the other variables and then I can use the CCD to predict the optimal parameters.
2. I assume the DSD model could not predict the optimal parameters to use since I did not have the correct ranges, is this right? Or am I missing that output somewhere? I thought since <50% of my variables were significant that the design could then be used for optimization.
3. Am I missing something from the output that I should be looking at? Again, I'm new and the output has a lot of data to digest. My goal is to ultimately maximize the recovery of bacteria so I focused on what variables are significant and not really the rest of in the output. Are there specific "things" I should really focus in on/pay special attention to besides the alpha values? I did find some messages about the free online class for DOE so I will be taking that but I would appreciate specific feedback to my output so that I can "relate to" what that class will teach me with my own data.
I tried to attach the fit output but I keep getting an error message that the JMP file is not supported. I attached it as a PDF.
Thanks in advance for the help/feedback!
Hi @Feline614,
So models pretty much agree on the terms included, which is a good sign.
Concerning your remark about X6, you increase the range to 20, but I wouldn't extrapolate model's predictions outside of the tested range. So it looks like there is an increasing linear trend of the response regarding X6, but maybe this trend stops after X6 reach the value 20. Or maybe not, and this would be a new fact/knowledge for you. But you can't be sure unless you test outside of this range. You can create validation/test points outside of the range and compare the measured values with the predicted values, to see if there is any deviation from the model that would indicate that the increasing linear trend may stop or be different outside of the tested range for X6.
Since there is indeed a quadratic effect for X6 included in the Least Squares model and supported by your domain knowledge, you can also force the inclusion of these terms (main effects and quadratic effects for X6) in the model construction : Advanced Controls (jmp.com)
Regarding your general question, there is no definitive model to choose : you can select several similar models, if it makes sense statistically and experimentally (domain expertise). What I would do is :
About X1, this is linked to your objective and what you want to do with the results : are you investigating the system only in lab environment, or do you want your findings to be used in another environment ? If it's the second option, you may run some confirmatory/validation runs in this new environment, to make sure the findings in lab scale can be transposed/used in a different environment. It may not be a problem if X1 is handled differently between lab and other environment from an analytical point of view : in your lab environment, you can maximize desirability with no constraints added in the profiler. In your other environment, knowing the fixed value of X1, you can lock the X1 value and optimize the other factors to maximize the desirability : Set or Lock Factor Values (jmp.com)
Hope this follow-up will help you,
Hi @Feline614,
Welcome in the Community !
I'm afraid there might not be enough information to guide you precisely for your topic, but I will try my best to give you some help and feedback. For further guidance, could you share an anonymized dataset (Anonymize Data (jmp.com)) to better evaluate and help ?
Your next steps are based on the outputs of a model and the detection of statistically significant effects. It seems you have tried only one analysis type with "Fit DSD" (which is the recommended analysis for DSD), but have you tried other modeling analysis (possibly Generalized Regression with different estimation methods if you have JMP Pro like "Pruned Forward Selection" (with AICc/BIC validation method), "Two-Stage Forward Selection" or eventually "Best Subset" (only if you have a limited number of factors, as it will test many models with main effects and many combinations of the higher order terms), or simply test Stepwise Regression as an alternative model) ?
It could be interesting before moving forward to see if different possible models agree on the detection of statistical effects, and check if/where they might differ. It could provide a more reliable overview about the important factors and effects in your study.
Concerning your other questions :
Some ressources to help you on understanding DSDs :
Introducing Definitive Screening Designs - JMP User Community
Definitive Screening Design - JMP User Community
The Fit Definitive Screening Platform (jmp.com)
Using Definitive Screening Designs to Get More Information from Fewer Trials - JMP User Community
And more generally on DoEs :
DoE Resources - JMP User Community
Hope this answer will help you,
Hi @Victor_G I really appreciate you taking time to write me a thorough response!
I attached an anonymized dataset.
I ran the various models you mentioned (attached), except "Two-Stage Forward Selection" was not an option (it was grayed out). The stepwise approach added one variable at 0.12, which looked like it may have an influence on the outcome per the prediction profiler but biologically, in this specific experiment, I wouldn't expect it to be significant (I will be repeating this experiment with different bacteria and I expect it to be significant in those tests). The "Pruned Forward Selection" with AICc and Best Subset AICc bumped one variable to p=0.0564 but it would make biological sense for it to have a significant effect on the system. "Pruned Forward Selection" with BIC and Best Subset with BIC were similar to stepwise with the addition of the one at 0.12. I also deleted the variables <0.05 but they did not change the significance of the other factors.
Overall, I'm not surprised by the variables the models selected, except X_7, I thought for sure that one would have a significant effect.
Do you have a good resource that explains these models at the beginner level? I found some resources on JMP but I only learned the Stepwise approach in college. I don't actually know how to interpret the other models you mentioned, besides the general understanding of a p-value.
1. I created an augmented design. I eliminated 2 variables (X_4 and X_5) because biologically they should not be significant in this system (they may be in a future test though). I left all other factors the same except the 2 factors that showed significance that appear to benefit from testing a higher range (X_6 and X_8). I did not change the default number of runs but did block them. Should I change the number of runs? The video you sent (Using Definitive Screening Designs to Get More Information from Fewer Trials - JMP User Community) at ~35:56 it states that a weakness of DSD is that "Factor range for screening may not include optimum so, follow on design will be over different ranges - really can't augment." But, like you suggested, I can augment it with different values... is this going to be an issue? Do I need to do anything differently when I go to analyze the data?
2. I can rerun those 2 points. When I enter the new results should I just replace those points or somehow create a new block to include the new values? If I need to create a new block, how do I do it? I can't explain why they appear to be erroneous; the one point to the left could have variability due to the low concentration of bacteria (Poisson distribution issue). The one to the right is a little more perplexing to me but I am working with bacteria so an outlier every now and then isn't abnormal.
For RMSE - I'm a bit confused about it. I'm tracking a low RMSE should be good and would reflect the model is good at predicting outcomes. I can't find any literature on anyone doing something similar to what I'm doing to know what to compare it to. Based on what I know about the system, I would not expect a lot of "noise" for this particular experiment. This particular bacteria and system give me pretty consistent results (I have more variability with a different bacteria but that is a problem for next week).
3. I plan to rerun the two "strange" points. I added the regression assumptions to the attachment "Fit Least Squares." The residual by predicted plot may have some clustering?
I agree, the blocking should be a random effect. I found on another discussion post you can do it when you initially do the design but I can't find the "box" to check (I assumed random was the default but I guess not). I made it a random effect, it did affect the even order effects. The "Lack of Fit" box is grayed out though... I'm not sure why or what that means. In simple terms, when there is an "even order effect," what does it mean when it is X_2*X_2. I understand when it is something like X_2*X_3 and that there is interaction between the two variables but I don't understand how a variable interacts with itself.
Thank you for the resources, they were extremely helpful, especially the "help pages." I'm still overwhelmed by the level of statistics but I appreciate your patience in teaching me and helping me be a better scientist. This will be my first of many DSDs, I'm glad I started with the "simplest" one!
Hi @Feline614,
Thanks for your response. It might be easier next time to provide a JMP file instead of several excel files and pdf files, as with a JMP file you can keep the analysis through saved scripts, and keep all column informations and properties (particularly for DoE as each factor may have up to 3 column properties, that I need to add manually when importing data from Excel).
The different models you created seem quite consistent with a lot of similarities, so it's a good sign.
I'm surprised to not see any interactions or quadratic effects in any model, as DSDs are quite effective at detecting them (if there aren't too many of them). I just relaunched the Generalized Regression platform by including interaction terms and quadratic effects for X2 to X8 (and keep X1 as a fixed effect block), and models seem to benefit from these added terms, both in terms of explainative power (R² increases up to 0,95) and predictive power (RASE decreases to 0,05). It seems there may be an interaction between X6 and X8 and a quadratic effect for X2. I suspect the statistically significant Lack-of-Fit test seen on your initial model may be linked to these missing higher order terms.
The models also very much agree between each others about the included terms :
Since you are interested in screening and optimization for this initial stage with DSD, using an information criterion like AICc for the validation method in Generalized Regression makes sense, since you want to have a model that is both explainative (only keep the most important variables) and predictive (keep the variables that do help improve predictions/lower the RASE/RMSE).
If you're looking for ressources about the Generalized Regression, here are my first thoughts :
Fitting a Linear Model in Generalized Regression
About your other points :
The good part about using the Block as a Random effect is that through a Mixed model, you can assess if the blocking variable has a statistically significant effect on the response variance. In your case, it doesn't seem to be the case, so you may remove this blocking variable from your model. Running models with this blocking variable as fixed effect also shows that this block effect has no statistically significant impact on the mean response, so it can be removed from future analysis.
Concerning the term X2.X2, it's not an interaction but a quadratic effect of X2 : it means the response is linked to X2 by a quadratic function/term : Y = X2² (+ other terms in the model).
I attached the JMP file with all the models and scripts tested for this answer.
I hope this follow-up will help you,
I’m sorry you had to put in all the data by hand. Also, sorry for the delay but I wanted to respond with the next batch of results. I believe I attached the correct format this time.
I ran the augmented model. I only removed 1 term based on my ‘expertise’ and left all other significant and non-significant factors from the original DSD. Runs 1-22 are the original DSD; runs 23-34 are the augmented design; runs 35 & 36 are independent replicates of row 5; runs 37 & 38 are independent replicates of row 10 (row 5 and 10 were ‘outliers’ in the original DSD).
Can you check my logic so I can confirm I am running and analyzing the data correctly?
How did you create the graph you embedded? I found these directions but it didn’t result in the nice visual comparison you provided: https://www.jmp.com/support/help/en/16.1/?os=win&source=application#page/jmp/model-comparison.shtml#
Biologically the variables marked as significant make sense. What does not make sense to me is X_6, the original DSD range was 1-5; it was augmented to do 1-20. Biologically doing anything more than 20 shouldn’t make a difference, which is what the least squares model shows but not the others.
How would you know which model to choose? How do you account for the max output value being 1.0? How would you go about looking at the optimal parameters for the variables?
Thank you!
Hi @Feline614,
So models pretty much agree on the terms included, which is a good sign.
Concerning your remark about X6, you increase the range to 20, but I wouldn't extrapolate model's predictions outside of the tested range. So it looks like there is an increasing linear trend of the response regarding X6, but maybe this trend stops after X6 reach the value 20. Or maybe not, and this would be a new fact/knowledge for you. But you can't be sure unless you test outside of this range. You can create validation/test points outside of the range and compare the measured values with the predicted values, to see if there is any deviation from the model that would indicate that the increasing linear trend may stop or be different outside of the tested range for X6.
Since there is indeed a quadratic effect for X6 included in the Least Squares model and supported by your domain knowledge, you can also force the inclusion of these terms (main effects and quadratic effects for X6) in the model construction : Advanced Controls (jmp.com)
Regarding your general question, there is no definitive model to choose : you can select several similar models, if it makes sense statistically and experimentally (domain expertise). What I would do is :
About X1, this is linked to your objective and what you want to do with the results : are you investigating the system only in lab environment, or do you want your findings to be used in another environment ? If it's the second option, you may run some confirmatory/validation runs in this new environment, to make sure the findings in lab scale can be transposed/used in a different environment. It may not be a problem if X1 is handled differently between lab and other environment from an analytical point of view : in your lab environment, you can maximize desirability with no constraints added in the profiler. In your other environment, knowing the fixed value of X1, you can lock the X1 value and optimize the other factors to maximize the desirability : Set or Lock Factor Values (jmp.com)
Hope this follow-up will help you,