Re: Blocking factor when analyzing data

JMPdiscoverer · Jul 2, 2021 03:49 AM

Hello,

My question is how to block a factor when analyzing data.

Thanks !

I

Mark_Bailey · Jul 2, 2021 08:29 AM

Your statement does not make sense. You have completed an experiment? You designed the experiment with blocked runs?

I am likely over-simplifying your situation, but you gave us very little information in the face of a complex technique (blocking). You can add a new data column to the data table and populate it values to indicate membership in a given block. For example, if the data table contains 16 runs and the blocks are consecutive 4 runs, then the new column might contain 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, and 4. Select Analyze > Fit Model. Assign the responses to the Y role. Select factor columns and make appropriate terms for the factor effects. Be sure to add the column with the blocks to the Effects list. The block effect is treated as a fixed effect, like the other factors, by default. If you want to treat the block as a random effect, select the block in the Effect list, click the red triangle next to Attributes, and select Random Effect.

JMPdiscoverer · Jul 9, 2021 9:08 AM

Hello,

thank you for this explanation and sorry for not giving that much information.

So, we runned a full factorial design with 2 factors (X1 and X2) and 2 responses. The main objective is to maximize the yield Y1 but to keep an acceptable level of impurities (Y2 lower than 70).

We have a significant variation in the lot-to-lot starting material.

According to what I have read in the JMP cmmunity , here is the difference between random or fixed blocks (please correct me if I am wrong) :

A fixed block is treated like the other factors, by default in the software. It is shown in the model.That means, we consider 3 factors instead of 2 when doing the analysis, but we exclude interaction and quadratic effects for the Lot.

A random blocks, it is used particularly if our objective is to account for the variation caused for the blocks and not needing make any inference about effects (which is our case here). That means, if I include multiple lots of raw material in my design, I don't care that Lot A increases the average response by 1.5 and Lot B by -0.6 because I am not going to use those lots in the future. I only want the lots I used in the design to be representative of the typical variation in raw material lots I will encounter in the future

We decided to go with random blocking as you described, however, there are two strange things with JMP :

- No ANOVA table was shown for random blocking

- Also when I do the simulator, there is no defect rate for random blocking

Could you please help me understand why those information are missing ?

Thanks !

I

statman · Jul 9, 2021 10:22 AM

Sorry, a bit late to the discussion. I'm not sure I understand what you are doing? It would be also helpful to us if you can provide the data table, even if you want to code the factors and response variables for confidentiality. This way we can try to reproduce/understand what you're getting as outputs of the analysis.

I don't understand this statement:

We have a variation in the Lot starting material which we already know, we want to consider this extravariability but not for future investigation (there is no need to add the lot as X3 in the model even if it is significant).

My thoughts:

1. Seems to me you want/need to account for the incoming variation of lot-to-lot variation which you know may be significant?

2. If you know there is lot-to-lot variation, do you know whether it effects the yield? If it does, what are you doing to reduce the incoming lot-to-lot variation?

3. Can you measure the incoming lot? Is there some measure of the incoming lot that you think relates to yield? If so, why not treat that measure as a covariate?

4. I will provide a different view of the use of blocking...As you indicate there are 2 ways to treat the block effect, as a random effect or as a fixed effect. To me, what determines which you use is the amount of understanding of what noise is confounded with the block. If you have a thorough understanding of what noise variables are confounded with the block, a more effective use of the block is as a fixed effect. When you analyze the model, you want to include the blocks AND all block-by-factor interactions. In fact, the most important effects might be the block-by-factor interactions. Why, because you want your factor effects to be consistent over changing noise (this is the definition of robust), this, in essence, means the absence of block-by-factor interactions.

Now if you have not done the due diligence of identifying what noise factors are confounded with the block, then you are left with treating the block as a random effect and a means of lowering the mean square error (which may provide for a more significant F-test).

"Block what you can, randomize what you cannot" G.E.P. Box

Block what you can identify, randomize what you cannot identify.

"All models are wrong, some are useful" G.E.P. Box

JMPdiscoverer · Jul 9, 2021 11:49 AM

Thank you for your answer!

Sorry, it was not well explained, I will try to clarify for the statement :

Our objective is to maximize a protein yield, we know that every starting material will have different protein concentration, and also every used lot will not be used in the future. That's why we don't care that Lot A increases the average response by 1.5 and Lot B by -0.6 because I am not going to use those lots in the future.

For the other questions:

1. Yes the lot-to-lot variation is significant. It will be interesting to account for the amount of this variation, if possible, but all the experiments were already done and only 2 lots were used.

2. We know there is lot-to-lot variation, and it has an effect on the yield. We are not interested in reducing the incoming lot-to-lot variation. We need to extract the maximum of protein independently of the starting protein amount.

3. Yes, we do measure the amount of protein in the starting material and it is related to the yield. However, I don't know what do you mean by "to treat that measure as a covariate"

4. Thanks a lot! yes it does make sens to add block-by-factor interactions.

" the amount of understanding of what noise is confounded with the block." If I understood well, for our example, the lot-to-lot variation is due to different protein concentrations which influence the yield (Y1) and of course different amount of impurities which also influence the residual impurities (Y2). That means we have to treat it as a fixed ?

Another thing to mention is that the R2 and R2 adjusted were around 50% that means we still have a certain amount of variability that we didn't consider in the model.

And for your statement, "Now if you have not done the due diligence of identifying what noise factors are confounded with the block, then you are left with treating the block as a random effect and a means of lowering the mean square error (which may provide for a more significant F-test)." I didn't fully get what you mean by "a means of lowering the mean square error"

Thanks for all of those explanations.

Attached you will find the data table, so maybe you can help me finding the optimal conditions to maximize Y1 while keeping Y2 lower than 70

Best !

I

statman · Jul 9, 2021 04:52 PM

I understand that you will not be reusing lots, but you need to account for the variation in lots in your model if you want to have a model that works on future lots.

If the incoming material can be measured, you can add that measurement as a random variable in the model (not as a random block). The model will then include this term for future prediction.

https://www.jmp.com/support/help/en/16.0/?os=mac&source=application&utm_source=helpmenu&utm_medium=a...

For a typical model Y = X1 + X2 + X1*X2 + Error

where error is all of the degrees of freedom not assigned to the terms (including ay block effect)

If however you are able to assign the block effect, your model is:

Y = X1 + X2 + X1*X2 + Block + Error

Since the block effect is assigned, the Error (mean square error) will be reduced increasing the liklihood of statistically significant effects due to larger F-Ratios.

I've briefly looked at your table. I see you color coded the columns of interest. I take it the orange ones are Lots, X1 and X2 and I see the 2 Y's. None of the other columns are explained.

I'm sorry, but I really don't see a designed experiment??? Certainly NOT a factorial on X1 and X2.

You could use regression to see if there are any relationships? Quick look at the data also suggests you have some unusual data points. Try Boosted Tree for Y1 and Neural Boosted for Y2.

"All models are wrong, some are useful" G.E.P. Box

JMPdiscoverer · Jul 12, 2021 01:12 PM

Hey Statman,

Thanks a lot for your feedback ! It is clearer now. I've updated the datatable (the previous was the wrong one).

We considered a full factorial design with two factors (X1 and X2) and two responses (Y1 and Y2) for each lot. For each Lot, one additional experiment was performed in triplicate (experiments "reference").

My statistics knowledge are very limited, I don't get why do we need to perform Boosted and Neural Boosted trees.