Data analysis on mixture design

PepinLB · Mar 26, 2024 4:47 AM

Hello everyone,

I've tried to create a sort of mixture design using a script.

The script allows me to have 10 components (a1, a2 - b1, b2, b3, b4 - c1, c2 - d1, d2) and my mixture would consist of only:

a1 or a2 or 0

b1 or b2 or b3 or b4 or 0

c1 and/or c2

d1 or d2 or 0

By selecting the steps, the script provided me with an experimental design table (which I enriched with new experiments and pure components because I had biased coefficients).

For model analysis, I used a second-degree model. I expected it not to work with so few experiments for so many coefficients, but for certain properties, it did work.

I'll illustrate with 2 properties.

For p2, the fit works well; I have a model that fits both observation and prediction.

For p4, the fit works well in observation, but I have huge standard deviation and the prediction performs poorly with a press R-squared < 0.

Would you have any feedback or suggestions for improving the quality of my model?

Thank you in advance.

statman · Mar 26, 2024 10:19 AM

Sorry, but I'm not sure what you did? I'm not even surest is a mixture design? I'm confused about the factors. Are there really 4 components a, b, c, d? Is a1 and a2 different levels of a? Factor b has 4 levels? If so, your data table is not setup correctly. Also don't understand what you mean by "which I enriched with new experiments and pure components because I had biased coefficients"?

Analysis of mixture designs does not follow typical experimental design analysis. Mixture designs will have multicollinearity, so interpretation of the statistics is a bit more challenging. This is because the components of a mixture design are not independent. As you change the level of 1 factor, it impacts the levels of the other factors. The sum of the components of the formulation = 1.

https://www.jmp.com/support/help/en/17.2/?os=mac&source=application#page/jmp/mixture-designs.shtml

I primarily use the mixture surface profiler to analyze mixture designs.

"All models are wrong, some are useful" G.E.P. Box

Dan_Obermiller · Mar 26, 2024 10:25 AM

And to add to @statman's questions, how many overall runs were conducted? Why would you put a1*a2 in the model when you never had a1 and a2 in the same blend? Is JMP even analyzing this as a mixture? I'm not sure it is, but you need to check by looking at the Analysis of Variance table and seeing the message: Tested against reduced model Y=mean. The model for p2 might look good, but it is clearly an overspecified model, so you are likely overfitting. That means the model would not predict future observations very well. Anytime your parameter estimates contain the words "Zeroed" and "Biased", JMP is automatically removing terms from the model for you in order to get any results. There is no guarantee that the correct terms were removed. Therefore, you should never trust the results when you see those words in your parameter estimates table.

It may be of benefit for you to attach your data table so that it can be looked at more closely. You have a lot of very unusual things going on with this "design".

Dan Obermiller

PepinLB · Mar 27, 2024 04:21 AM

Hi,

Thanks for your reply. The script gave me 36 or 40 runs and I added many runs because the coefficient for a1, a2... were not accurately defined. So i added the run with 100% of a1, a2.... to have the actual value of those coefficients.

I did not put a1*a2 in my model (it is = 0).

I attached the file.

Dan_Obermiller · Mar 29, 2024 8:45 AM

Thank you for sending the data. The section of your output labeled "confusions" suggests that your model is overspecified, meaning you have terms in the model that the data does not support estimating. This can lead to biased estimates which can lead to an increase in the variance.

I analyzed both responses by just using a main effects model. No cross terms at all, and here are my results:

The model for p2 does look pretty good, even without seeing the residual plots. Similarly, the model for p4 looks good. Again, I did not include the residual plots. Notice that I do not see the high standard deviation or the negative PRESS that you found.

Without seeing similar results to you, I am guessing we used very different models. I would further guess that the high standard deviation and negative PRESS that you saw are caused by your model overfitting the data. By specifying a model with too many terms in it, the fit on the observed data will be great, but it might be too good. In other words, the model fits the signal AND some of the noise. By fitting some of the noise, that will lead to the increased variance and the model will not fit new observations very well. Another indication of overfitting is a fairly large discrepancy between RSquare and Adjusted RSquare (which are not part of your analysis pictures). Overfitting is one reason why just looking for high RSquare values is not always the best thing.

That is my best guess on what is going on: too many model terms and that has caused overfitting. Take a look at your RSquare and Adjusted RSquare. They should be fairly close to each other, maybe within 0.07 or so (just a rough idea on how close is close). To correct the problem, try removing unnecessary model terms, especially the ones that are not supported by the data that was collected.

Dan Obermiller

Dan_Obermiller · Mar 28, 2024 09:44 PM

Oh, and one more thing that further supports my guesses is that there are several missing responses for the p4 response. That means that you will not be able to fit the same model for p4 as you did for p2.

Dan Obermiller

PepinLB · Mar 27, 2024 03:57 AM

Hi,

Thanks for your feedback.

I can confirm this is indeed a mixture design with 10 factors.

a, b, c and d are a type of component and the associated number is a subtype, not a level. You can see, the sum of each row is = 1

But I included those constraints because for instance I wanted to avoid a1 and a2 or b1 and b2 in my blend. Let's imagine a pastry recipe: I may want to include a fruit in my recipe. It could be either lemon zest (a1) or orange juice (a2), or neither. These are indeed two distinct components and not categorical variables.

"which I enriched with new experiments and pure components because I had biased coefficients"

I meant that i added new run because the coefficient for a1, a2... were not accurately defined. So i added the run with 100% of a1, a2.... to have the actual value of those coefficients.

In the same purpose, since i had many biased coefficient, i did new runs to get a value but as I said the standard deviation is still high and i don't have a good prediction.

Data analysis on mixture design

Re: Data analysis on mixture design

Re: Data analysis on mixture design

Re: Data analysis on mixture design

Re: Data analysis on mixture design

Re: Data analysis on mixture design

Re: Data analysis on mixture design

Recommended Articles