Solved: Re: Feedback on Custom DOE to Predict Proportion of Rejectors

Shujinko · Jun 10, 2023 4:44 PM

The Background

I recently ran a pilot study with 5 different food samples that have varied "Starch," "Sugar," and "Salt" as continuous factors. Each of the 5 food samples was tasted by 150 panelists each, who either accepted or rejected the sample. The percentage of rejecting panelists is what I would like to model as a function of my three continuous factors.

My pilot study yielded some interesting insights that align with what I expect, but due to only having 5 different samples, my modeling was limited to just using the "Sugar" and "Sugar*Sugar" term. I used a GLM --> Normal --> Logit model, as I know there is curvature and I want my output to be limited to 0 - 1 since it is a percentage. I have attached my data table with the saved GLM ("Pilot Data SSS-Prej_Sugar Logit Model"). It is based on an approach Mark Bailey had recommended in a prior post for proportion data.

Next Steps

My next steps are to test more food samples that have varied "Starch," "Sugar," and "Salt" levels so that I can produce an accurate and precise multivariate model (instead of just "Sugar" and "Sugar*Sugar" terms) for estimating percentage of rejecting panelists. I would like to capture both main effects and crossed effects. To that end, I used the DOE --> Custom Design platform and set the following parameters:

Responses: Percentage Rejection, Goal to minimize, Lower limit = 0, Upper Limit = 1

Factors: "Starch" (38 - 55), "Sugar" (15 - 36), "Salt" (23 - 50) as continuous easy factors, and then I used the RSM to get crossed terms. I removed some terms such that I had 1 intercept, the 3 main effects, and 4 crossed terms.

Replicates: 5

Centerpoints: 0

Number of runs: 40 (we have the budget for that)

Random Blocks: 8

The design/table is attached as "Custom Design SSS-Prej Example"

My Questions

1. Once I run my experiments and get my data, I once again have the option of trying different models. To recap, my goal is to be able to predict how my three factors would influence the proportion of rejectors, and potentially to find minima and maxima for proportion of rejectors. I expect there to be curvature in at least two of the terms ("Sugar" and "Salt"). Are there any issues with again going with GLM --> Normal --> Logit?

2. The reason I ask is because I was following through an exercise in the "JMP Start Statistics" book using the "Reactor 20 Custom" data set for a screening experiment (to look at Percent Reacted), but the exercise used the "Standard Least Squares" + Effects Screening approach even though the output (Percent Reacted) is theoretically constrained from 0 to 1. This page describes this exercise as a homework assignment.

3. In practice I can't exactly follow the custom design levels, but I do have a library of samples that I can pick from that have varying levels of "Starch" "Sugar" "Salt" for testing. It turns out that "Sugar" and "Salt" are also moderately collinear (I don't have much control over this). See the following table ("Example Candidates for SSS-Prej Model"). Should I be concerned about the quality of my DOE and approach given the above concerns?

Thank you for any help you can provide.

Mark_Bailey · Feb 23, 2022 10:42 AM

You know the total responses. Moreover, you know "accepted" and "rejected" counts. You can therefore use binary logistic regression or GLM with a binary distribution and the logit link function.

It is OK if you can't follow the treatment exactly. For example, it if calls for Sugar = 15 and it is known to be 17, then record the responses and update the factor setting for that run. This way, the regression will have all the correct levels.

View solution in original post

Shujinko · Feb 22, 2022 04:41 PM

Actually regarding Factors: "Starch" (38 - 55), "Sugar" (15 - 36), "Salt" (23 - 50), I didn't remove any terms so it's just the full response surface. I would have edited my prior post but I don't see that capability.

Mark_Bailey · Feb 23, 2022 10:42 AM

You know the total responses. Moreover, you know "accepted" and "rejected" counts. You can therefore use binary logistic regression or GLM with a binary distribution and the logit link function.

It is OK if you can't follow the treatment exactly. For example, it if calls for Sugar = 15 and it is known to be 17, then record the responses and update the factor setting for that run. This way, the regression will have all the correct levels.

Shujinko · Feb 23, 2022 02:17 PM

Thanks again Mark. So some additional context that I neglected to mention in this example:

I have a conditional question for the rejectors only, which asks them why they rejected the sample. The rejectors are prompted to pick a reason.

I am particularly interested in the proportion of total panelists who rejected on the basis of flavor. Thus if out of 200 panelists total, 30 of them rejected the product and 10 of those indicated flavor, I get a proportion of 10/200 = 2%.

As I change my product flavor as a function of my ingredients, I expect the proportion of flavor rejectors to change, e.g. 5%, 10%, etc.

Ultimately I want to be able to take an untested combination of Starch/Sugar/Salts (within our range of the training set) and predict a proportion of hypothetical panelists who would reject on the basis of flavor. This could save us a lot of time and money in not having to do actual product testing.

My question here is whether there are any differences with that strategy as opposed to doing say a multinomial logistic regression on the raw counts with "did not reject," "rejected not on the basis of flavor," and "rejected on the basis of flavor." Or alternatively, when looking at just the group of rejectors, whether they "rejected not on the basis of flavor" or "rejected on the basis of flavor."

I did run the multinomial logistic regression with my counts and see that with my prediction profiler, the proportion of "rejectors on the basis of flavor" continuously change as a function of the Starch/Sugar/Salt content, and it matches the actual proportions. However, what I like about turning all my raw data into group proportions and doing GLM --> Normal --> Logit is that I get confidence intervals of prediction (e.g. 9%, 7-11% rejection).

Also, in the example of "JMP Start Statistics" I described above, I was following through an exercise in the "JMP Start Statistics" book using the "Reactor 20 Custom" data set for a screening experiment (to look at Percent Reacted), but the exercise used the "Standard Least Squares" + Effects Screening approach even though the output (Percent Reacted) is theoretically constrained from 0 to 1. This page describes this exercise as a homework assignment. Would there be a reason for why the exercise recommends standard least squares which gives the possibility of Percent reacted values outside of 0 - 100%, as opposed to a different modeling approach?

Shujinko · Feb 23, 2022 02:18 PM

edit above for math: Thus if out of 200 panelists total, 30 of them rejected the product and 10 of those indicated flavor, I get a proportion of 10/200 = 5%, not 2%.

Mark_Bailey · Feb 24, 2022 11:08 AM

First, regardless of your choice of model, I would verify the selected model with new data. Typically, find factor settings that predict the ideal outcome and also an undesirable outcome. (A realistic model should provide accurate predictions over the entire space and that is what verification is for.) Test these settings in the field.

There are many ways to model responses and predictors. None of them are perfect. They all have assumptions that should be reasonably met in order for the model to serve as expected. So the choice might depend on an information criterion, a test statistic, historical precedence, personal comfort, or any number of other reasons.