This is more of a conceptual question related to DSDs and their reliance on the sparsity of effects/pareto principle. I am trying to create a DoE for characterization of a bioprocess for manufacturing of drug product. The goal is to collect data for the parameters we've identified, create a model, and use the prediction profiler in JMP to set limits for each of the parameters so there are no "failures" in say, 10,000 runs. When selecting our parameters, we do a risk analysis of all the parameters in our process. So we began with about 40 parameters. Then we ordered them and ranked them based on risk of impact to the process outputs. After ranking them, we eliminated the bottom ~80% of parameters, leaving us with about 10 factors to test. What I am curious about, is whether it is appropriate still to assume the sparsity of effects principle that the DSD seems to rely on (If I'm understanding it correctly) in this situation since we already eliminated what we considered to be the bottom ~80% of parameters? Shouldn't a large portion of the remaining factors be significant then? Granted, out of the 40 parameters we began with, a lot of them were minor and we included them only to do our due dilligence.
My followup question is if we have budget to test ~48 conditions, would it be better to do a custom design and simply remove interaction effects I don't believe will be important until I get the N down to below 48? A DSD only requires 25 conditions for 10 continuous factors and since I don't have much characterization experience I'm a little worried it won't provide enough data to confidently model where we should put our process limits if any of the parameter ranges we have decided to test lead to results that fall outside our output specifications. I realize we could always do a DSD and then supplement if we need more runs, but this would be less efficient for us (since it would require more blocks) than just starting with a design that counts for 48 runs.
Thanks for your help!
In the first paragraph of your opening remarks it sounds to me like you are concerned that, wrt to the 10 factors, a majority of them will be active. What does your process and domain expertise tell you? Do you have pre-existing process data that you can leverage using data visualization, exploratory data analysis, and simple to advanced modeling techniques to help inform your knowledge? If you truly have ZERO domain/process knowledge then I think you should go the DSD route as a FIRST experiment to build that knowledge. From there, with remaining resources, work on the optimization part of your problem using a SECOND experiment. In practice the new product/process development teams I worked on found the sequential DOE process to be the most efficient to solving the practical problem. DSDs are first and foremost SCREENING designs.
First, thank you for your reply. Unfortunately, while I have a good feel for which of these parameters will have an impact on process outputs we constantly measure such as pH, cell growth, etc., I'm less certain of their impact to the drug product (and they could have an impact, which is why they scored higher in our risk assessment) and that's typically something we just have to investigate. We don't have that data for this specific process. So based on your reply it sounds like in this scenario your recommendation is to start with a DSD and then fill out the significant effects with more runs. More specifically, I would run the DSD, and then use the fit model tool to one by one eliminate effects not statistically significant (cutoff around ~p=0.05 or any clear cutoff around there). Then I would take the remaining effects and put them in the custom designer with the remaining runs we have budgeted to get the best design possible. Does that sound similar to how you would do it? Thanks for your help!
The idea of screening is to separate the "trivial many from the vital few."
I would consider adding a few more runs above the minimum number of runs in your DSD. JMP now defaults to 4 extra runs. You might consider adding 1 or 2 more runs. This small increase will dramatically improve the power of your design.
No, do not take your knowledge from the DSD or any other initial experiment and start a subsequent experiment with Custom Design. Use the Augment Design platform. This platform uses the custom design algorithm but it also uses your existing runs. So you can incrementally improve the power and predictive performance of a design. This approach is known as sequential experimentation, which has been advocated for almost five decades.
(By the way, the DSD is a special case of an alias-optimal custom design.)
Hi @markbailey, thanks for your help. I have played with the augment design tool a little before. I would like some more clarity on how to differentiate the vital few from the trivial many though when I use the augment design. When you first press on "Augment Design", a screen comes up that asks you to select your response and factor parameters. Let's say that only 4 out of the 10 variables come up as significant from the initial DSD experiment. Should I only select the 4 factors in this screen? Or should I select all 10, and then on the next screen when it asks for model terms I should start removing terms that aren't significant, and including interaction effects and polynomial terms that I am interested in. Or do these do the same thing?
You only carry forward the factors that you decide are active. That decision might not be clear about all the original factors.
Yes, you should modify the model terms to update them based on your current knowledge or questions.
I recommend Help > Books > Design of Experiments. There are chapters about screening, DSD, and augmenting designs.
Thanks @markbailey . One more quick question on a different topic. Let's say I have 3 factors, temperature, pH, and production length. I want to see the impact on drug product produced. What would be the negatives of designing an experiment with only 2 factors (temperature and pH) and taking different time point samples to represent production length. Let's say you have 3 levels for each factor (and we'll assume full factorial). So 3 x 3 is 9 runs. For those 9 runs you take 3 time points each. Well when you plug this into JMP software you make it look like actually you had 27 runs, when in reality a lot of those different time points were the same condition just at different times. Are you somehow biasing the results for the other 2 factors (temp and pH) doing it like this? Or is this acceptable and recognized as making experimentation more efficient?
It all depends on how you view production length. Is it a factor (input) to influence the outcome or is it a repeated measures design? Would you select a level when optimizing or determine a window in the design space for it? That is a factor.
Are you interested in the time course? If so, then you can use JMP Pro Functional Data Explorer, although three points is not much of a function. Or you could use another model for the Y( time ) and use model parameters for responses if you have JMP. Either way, it is part of the response.
Bias is a model thing, not a design thing. (Of course, the design must support any of the models in question.)
Efficiency is a model thing, economy is a design thing.
@markbailey It's definitely a factor. Yes we would be trying to define process limits to it (determining a window). So you're saying in that case it would need to be it's own factor in the design, and it would be improper to not include it in the design and simply take time course samples. But it's not clear to me why that is. What's the issue with taking time course samples, and then post-experimentation simply adding the data back in as if it were its own factor? In my example with the 3 factors, this would look like copying and pasting your 3 x 3 design two more times, and adding a new time column where you have t1, t2, and t3 for each set of 3 x 3 conditions. Then filling out the response column with your results, where rows with t1 have the results from your first timepoint, and rows with t2 have results from your next timepoint, etc. Is it because when you fit a model there's an assumption of independence between your data points for any given factor? So when you take a timecourse your t3 sample is not independent of your t2?
Please do not 'put words in my mouth.' I did not say that your idea of treating time as part of the response is "improper." It is simply not what I would do. Since there is no 'what,' there is no 'why.'
You think in terms of combinations. I would say that you are design-oriented. I think in terms of estimation. I say that I am model-oriented. Both orientations work and one way does not have to explain itself or justify itself to the other way. (I work in a group that approaches experiments from both directions.) There is nothing that you have proposed that won't work as long as the design of the experiment, execution of the experimenta runs, and the data analysis (model) are consistent. I start with the model. You start with the design.
The issue of independence is real. You are proposing to run a repeated measures experiment but analyze it as a completely randomized factorial design. A statistical model has some combination of fixed and random effects that should reflect how the data was generated as well as what you want to know about the response. You are creating a series of whole plots with the two fixed factors in which to observe the third factor. The 3 observations in each plot are correlated this way.