Discussions

ajoneja · Jun 8, 2023 2:08 PM

Hi community, I am having trouble modeling some data that I got from a DSD. This was a screen of 10 factors. We ran this with 3 blocks (operators) and I included the maximum number of extra runs (8). This gave me a table of 31 runs. We did some initial range-finding experiments to set reasonable bounds for all of the factors. Unfortunately we still got 4 runs where the reaction completely failed, so I have no value to put in as a response.

My first question is how these failed runs should be treated? Ideally I suppose they should be excluded, and I would still have 27 runs that did provide data. But if I exclude them from the table and try to run the Fit Definitive Screening script, nothing happens. There is no error message but the window just doesn't open. My response is "Time to Completion", with the objective of minimizing this response, and I could just put in a large number representing a time longer than the experiment duration but this seems arbitrary. The script will run though.

My second question is whether I can include replicate runs to have more confidence in the model? We ran 8 replicates of each run condition. However, if I copy and paste a run and try to run the script, again nothing happens. Right now I am just taking the average of the 8 runs and using that, but I'm wondering if it would be better to somehow include the individual values and if so, is there a way to do it with the Fit Definitive Screening script?

I can use the standard Fit Model module to solve both of these problems but it seems like I'd be missing the benefit of the DSD. Thanks for any responses! I'm on JMP 15.2. If posting the table would be helpful I'm happy to do it.

P_Bartell · Feb 15, 2022 08:45 AM

First, I have not looked at your data so I can't replicate the failure modes you describe...but I think you are out of luck. To reap the features of a DSD, you really need to have all the responses for each treatment combination. It's a very economical and information packed design for the two stage analysis as prescribed by the Fit Definitive Screening script. Here's my general suggestion, again I haven't looked at your data. But I'd give this a try at least...you can maybe back your way into the same conclusions you'd reach from a DSD analysis workflow.

You'll need to use the Fit Model platform. Include all the replicates 'as is'. Might as well...there may be useful information in the variation among those points...either statistical or practical. Since a DSD analysis workflow assumes effect sparsity to realize the benefits of a DSD approach...I'd start by fitting a main effects only model and see which effects are active. Hopefully the missing treatment combinations don't hurt too much finding these active main effects. Once you got those active main effects, now fit a model with those active main effects AND all possible two factor interactions including those main effect terms. Fit that model...and see what it tells you.

This two step approach conceptually mimics what the Fit Definitive Screening analysis platform does...sort of. I think this approach might get you close?

I'd be interested to see what others such as @Mark_Bailey or @statman might think?

statman · Feb 15, 2022 10:28 AM

My thoughts:

1. Response variables: Can you measure something other than "time to completion"? For example, % of completion or chemistry after 'x' amount of time....in other words other ways to measure the phenomena.

2. This is a concern for any DOE strategy with optimality in mind. They are more susceptible to lost/missing /atypical data.

3. How to replace the missing data (some ideas on this, but the more missing data the less useful)?

Use the mean of the data set
Use predicted data (hopefully these predictions were done á priori
Regress on the data that you do have and use the model to predict the missing data (using the fit model platform)
Do all of the above and compare the results....if they somewhat agree, you have greater confidence in the resultant analysis, if not then you have to decide What did you learn and what should be the next iteration.

4. I'm a bit confused by the statement "We ran 8 replicates of each run condition". Replicates or repeats? Were they independent treatments? If they truly are replicates and all 8 replicates did not react (the 4 times), what did you learn?

"All models are wrong, some are useful" G.E.P. Box

Mark_Bailey · Feb 15, 2022 10:36 AM

I want to verify that Fit Definitive Screening does not work if you have missing values or deleted or excluded runs from the definitive screening design. The specialized analysis depends on the fold-over structure of the design matrix when forming the model matrices for analysis. (Note that you can use Fit Definitive Screening with any fold-over design!)

I wonder if it would be possible to replace the missing values for the runs without a reaction with a very large number for Time to Completion? This recode is a numeric trick. You simply want the analysis to proceed where some runs took essentially an infinite time, but represent this with a finite number. For example, if the reaction time ranged from 5 minutes to 2 hours, then you might use a value of 1000 to 100000 instead of missing values. Does that work? I am trying to reclaim as much of your experiment as possible.

Are the extra runs truly replicates? That is, they are not simply repeated measures of a single run. I suspect that Time to Completion requires that they are true replicates - just checking. If so, then I would definitely include the individual replicates and not the average of them. A counter argument, though, is that you could estimate the mean and the standard deviation and model these to find conditions that give you both the minimum mean time and the minimum variation in the time.

My final comment is that Fit Definitive Screening is a wonderful, specialized tool. It is really about model selection, and that is as far as it goes. You have many other tools for selecting the best linear model.

ajoneja · Feb 15, 2022 02:08 PM

Hi @Mark_Bailey , @statman , @P_Bartell ,

Thank you for the thoughts! I have attached my data table so you can replicate what I am seeing, if that is helpful. Right now, the cells without a value in the "Time" column are the failed runs. Without a value, the Fit Definitive Screen script doesn't run. If I exclude those rows/runs, the script still doesn't run. If I assign random values, it will run. I know it is best not to exclude any runs and this will harm the model, but since I added 8 extra runs when generating the table, I was hoping that losing 4 to failures might still work. I was just surprised that the script won't even try! I have started by trying the approach that @P_Bartell suggested and I have questions there too but that should probably be another thread and I want to spend some more time on it first. To answer some of the other questions that came up:

- Yes, I can try to use a different metric than "Time to Completion". The only one I could think of is "Time left in experiment", and try to maximize that value (it would be zero for the failed runs). This is a biochemical reaction that produces a signal that is supposed to increase with time until it reaches a maximum. In the failed runs, no signal was ever produced, so I don't have any other value to use.

- I have tried assigning values to the failed runs, such as numbers larger than the duration of the experiment, or just very large numbers (approximating a Time of infiniti because it never worked), these seem to really skew the model and make all the other data points less impactful.

- To clarify, my replicates truly are repeats of the run conditions, not just multiple measurements of the same run. So in total we ran 31*8 = 248 reactions. These reactions naturally have some variability so replicates are important. The values currently in the table are the mean of the 8 runs. I thought there would be some value in including the data from each run in the model, but if I try to add extra runs to the table, the Fit Definitive Screening script doesn't run.

- To address "What did I learn from the failures": If I look at the failed runs, 3 out of 4 have a combination of Low Mg and High Log2X1. This was expected to be detrimental to the reaction, and I had considered setting up a constraint so that this combination could not be used, but the DSD didn't allow it. If I had done a custom DOE I would have made this constraint. (although, there are cases where Low Mg and High Log2X1 did work, so this is not a guaranteed failure).

For now I will try modeling this data by excluding the failed runs, including the individual replicate data, and not using the Fit Definitive Screening script.

statman · Feb 15, 2022 03:57 PM

Quick question: How much of a change in Time is of scientific significance (or practical significance)?

"All models are wrong, some are useful" G.E.P. Box

ajoneja · Feb 15, 2022 04:46 PM

Anything greater than 5 min would be of interest. For our "midpoint" assay for the 3 Blocks/Operators, it was completed in 16, 25, and 25 min (which is in itself a bit concerning that one operator was somehow so much faster).

Many of these conditions are much worse than that but we were hoping to identify factors that could help us improve on that timing.

statman · Feb 15, 2022 05:32 PM

I ran a few analysis (fit models) by block and including block as a fixed effect. Attached data sheet. The analysis are using saturated models and have not been reduced. Yes, block effects can be quite challenging...leading to lots of questions.

"All models are wrong, some are useful" G.E.P. Box

ajoneja · Feb 16, 2022 09:59 AM

Thank you @statman ! I am fairly new to DOE modeling so trying to find my own way to the models that you scripted, but focusing on using the whole data set with block as fixed effect, and it seems like including the replicate data helps. The differences between the individual block models are certainly concerning. Thanks to everyone for the help.

ajoneja · Feb 25, 2022 09:44 AM

Hi @statman , the model you provided including all data and the block as a effect makes sense from what we know about the system. I have been trying to recreate your model but having difficulty, in particular I am not sure how you settled on looking at the secondary interactions and quadratic effects for a subset of the factors (pH, Mg, Log2X2, Log25X3), which are providing the nice curvature that we expected.

I have tried the strategy that @P_Bartell mentioned above - screening main effects, then crossing those with all possible combinations of factors. I have also tried using Stepwise Linear Regression in a few different ways. But nothing has lead me to anything as predictive as what you provided, and I'd just like to know how you got there. Any insight into your thought process would be appreciated! Or if there is a particular help page I should read up on, please let me know.

Discussions

How to treat replicates and failed runs in a definitive screening design

Re: How to treat replicates and failed runs in a definitive screening design

Re: How to treat replicates and failed runs in a definitive screening design

Re: How to treat replicates and failed runs in a definitive screening design

Re: How to treat replicates and failed runs in a definitive screening design

Re: How to treat replicates and failed runs in a definitive screening design

Re: How to treat replicates and failed runs in a definitive screening design

Re: How to treat replicates and failed runs in a definitive screening design

Re: How to treat replicates and failed runs in a definitive screening design

Re: How to treat replicates and failed runs in a definitive screening design

Recommended Articles