I need advice with setting up a DoE.
The goal of the DoE is to build a (logistic regression) model for a binary response (y) depending on five continuous (x1, x2, x3, x4, x5) variables. The five variables need to add up to 1, hence a mixture response design may be ideal. The response as mentioned is pass/fail depending on the variables combination. In addition, there are constraints on some of the continuous variables .... for example 0.4 < x4 < 0.5 etc.
I tried a mixture response design with these design limitations. And when I simulated responses and built a model to check the model diagnostics, I found that the VIF is too high for constrained variables suggesting confounding effects. I think the high VIF is because of high correlation between x5 and its cross terms with other variables. This can be confirmed by looking at correlation of estimates.
With that background I have couple questions.
1. Is mixture response surface design the best approach for my problem? Or should I try something like a sequential design approach. there is a bit of literature on sequential design application for binary response models.
2. How can I reduce the VIF or the confounding between a constrained variables and its cross terms?
I appreciate any help that will solve my problems. Please let me know if more information is required.
The VIF's in a mixture design are very large due to the non-independant nature of the mixture summing to 1. That is to say basically although you are investigating 5 factors in essence you only need 4 to understand the 5th due to this property.
This may have been answered in this JMP User community thread...
A great reference for this topic is Cornell's book on Mixture Design.
Thank you for the response LouV!!
I do understand the impact of mixture design on nature of factor inter-dependency. However, in my case the VIFs skyrocket once one of the factors has a narrow range. Attached are two JMP files: one is a custom design (CD_5var.jmp) with five mixture factors with a range from 0 to 0.5. The model in this case is decent with all the VIFs less than 5. The second design (CD_5var_narrow.jmp) is similar to the first except that the range on x5 is narrow (0.5 < x5 < 0.6). And in this case the VIF for x5 is ~800, and x5*xi also have higher VIFs. I also looked at the correlation of estimates, and it clearly shows high correlation of x5 with the cross terms.
So, the question is .... does the high VIF in a mixture design (especially for the narrow range factor) matter in the final model!? Because the high VIF seem to be due to correlation with its cross terms.
And to repeat the other question is ...... is there an alternative to the Custom mixture design that will reduce the issues with VIFs? For example, a sequential design approach!?
LouV ... your link on the impact of coding property on the VIFs is interesting. I will follow up and see if that will help.
Note: The response 'y' is a simulated response. Eventually this will be a binary response.
There is good news and bad news.
The bad news is that the VIFs are a result of the relationships between your X's. As pointed out already, mixtures will ALWAYS have some relationship. That relationship is even stronger when you have narrow ranges for any of the components (as you have seen). It will not matter what approach you use to create the design, that collinearity will be there.
The good news is that the VIFs can be reduced somewhat in the analysis by using pseudocomponent coding. JMP will turn this feature on for you automatically. Pseudocomponent coding will reduce the collinearity as much as possible by coding your mixture factors to "expand" the range as much as possible.
When you fit your Scheffe mixture model in JMP, you can see pseudocomponent coding is being used in the Parameter Estimates table. A term will look like (X-0.2)/0.7 which indicates the coding is being used.
JMP Education has a course on the design and analysis of mixture experiments that covers pseudocomponent coding and much more.
"Does the high VIF in a mixture design (especially for the narrow range factor) matter in the final model!? Because the high VIF seem to be due to correlation with its cross terms."
One thing that comes to mind when I read this question above: Perhaps you could run some validation trials in addition to the DOE trials to validate the final model - similar to training, testing, and validating as done when using the Partition and Neural platforms.
VIFs aren't very important for mixture designs and its a good thing! They are always high. If your goal is to make predictions, the VIFs are always irrelevant. If your goal is screening, you need a different technique than comparing the effects. A good regression model can predict well even if it can't estimate effects well. A better test of design quality in this case would be the Condition Number, but I don't think JMP provides this (although you can get it if you write a script). If the log of the condition number is greater than 5, beware -- otherwise, you should be fine.