Discussions

Meli1 · Aug 7, 2025 08:08 AM

Hello everyone !

I am reaching out because I am having trouble processing data from a DoE. I generated an optimal design to study the infleunce of formualtion parameters on a food product. 9 judges tasted the products by comaparing them to a reference product and evaluated them on several sensory descriptors as follow :

-3 = much less than reference, -2 = less than ref, -1=a bit less than ref, 0 = equal to ref 1 = a bit more than ref, 2 = more than ref, 3 = much more than ref.

So I have ordinal data for my responses. But as a first approach I tried to consider the data as continuous and performed a multiple linear regression. To do so I combined the data form the judges by using averages (before that I studied the effect of judges on my raw data). The fit was not too bad and I could make some conclusions with that.

As a second approach I wated to try ordinal logistic regression. However, I can not find a way to combine the data of all judges and have 1 line per product basically. From what I understood, keeping the data from all the judges will not allow a proper fit because I won't be able to know which judge's data the model used to make the fit ?

Any suggestions on that ? Thank you ! :)

statman · Aug 7, 2025 09:41 AM

Welcome to the community. I think you are on the right track. Since the multiple evaluations per treatment are not DF's that data needs to be summarized. I would look ate the within treatment variation graphically first (e.g., variability chart) and then assess whether there are any obvious evaluator biases (color or mark by column>evaluator). If there are no assignable differences in evaluators, then use the mean. I would also suggest you take the standard deviation as well as another Y. If you want to know what factors in your experiment effect the average response, model the mean. If you want to assess which factors affect the variation among the evaluators, then model the standard deviation. While it may not be technically correct to treat the response as continuous, I don't think you'll get into too much trouble.

"All models are wrong, some are useful" G.E.P. Box

Meli1 · Aug 12, 2025 03:24 AM

Thank you statman for your answer !

There was no obvious evaluator bias. So I will try using the means. But if I want to use means for my logistic regression then I would have to round them up to go back to the original levels right ? Because if I don't then there will be too many response modalities ?

Than you again !

statman · Aug 12, 2025 10:09 AM

Why do you want to use logistic regression? My argument is it is not as efficient or effective as least square regression. Of course, I am trying to understand causal structure, not probabilities. I don't really know your ultimate objectives. IMHO, no you don't really want to use logistic regression. . Since you are averaging the responses, the resultant response is more continuous like and therefore allows you to use OLS.

"All models are wrong, some are useful" G.E.P. Box

Meli1 · Aug 13, 2025 02:47 PM

In reality my goal is to compare two approaches to prorcess the data :

-considering the data can be treated as continuous by using multiple linear regression

-considering data can only be treated as ordinal (as it is) by using ordinal logisitc regression

For multiple linear regression I applied my model on means. But I am note sure if is is the right thing to do. For example what if I added a judge effect to my model ? Would that help to not consider the repetitions of each judge as DF ?

For ordinal logistic regression since the idea is to approach the data as ordinal now I am thinking that I should not use means. Maybe I could also include a judge factor in my model so that the repetions due to each judge are not counted as DF ?

And what if my judge factor has a significant effect ?

statman · Aug 14, 2025 1:53 PM

The judges are not independent from the treatments (they are within treatment), so as far adding a judges term to the model, you don't have DFs for this. This is why you must first evaluate the within treatment variation (judge-to-judge and within judge) before summarizing the data. Realize there is no "right" answer. I think your approach to trying different analysis techniques is a good one. Though, again, I will reiterate, the ordinal scale is less effective than a continuous measurement. And though others may. argue, logistic regression is also lead. efficient and effective. But to each his own.

I will say your scale does not seem to meet the criteria Deming suggests. Please read the chapter on operational definitions in "Out of the Crisis". Much less, a bit less are not operationally defined (not everyone has the same interpretation). IMHO, you should spend adequate time understanding the measurement process á priori the experiment. Taste testing is particularly challenging. You might read Cox, The Optimal Number of Response Alternatives for a Scale: A Review. There is also a really old paper I have attached (Hope this is within the rules).

"All models are wrong, some are useful" G.E.P. Box

Victor_G · Aug 7, 2025 01:34 PM

Hi @Meli1

Welcome in the Community !
As already stated by @statman, trying a regression model on these notes can still be a good idea.

I would also try a simple option, using an ordinal logistic regression with reduced number of classes (3) : Worse (-2/-1), Neutral (0) and Better (1/2). Having less classes should help the model distinguish key differences between the samples, and avoid misclassifications between closely related classes (for example -2 and -1). If the model still has difficulties separating the 3 classes, you could add a new column "weight" and specifying weights of 1 for former classes -1/0/1 and higher weights for extreme classes -2/2 to help separate extreme results, adding this column as a "Weight" option when launching the model. You could then extract raw probabilities for each class, which help understand how similar/different are samples and should help you assess evaluation uncertainty.

There are also other options available depending on your objective and how data was collected : did you make the evaluators test the benchmark blindly to assess the notes ? If yes, you could evaluate the distances between samples and benchmark based on notes, and also "re-scale" notes of evaluators based on this common reference defined at 0.

Hope this complementary response may help you,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

statman · Aug 7, 2025 02:13 PM

Interesting, I would take a different approach to the ordinal data. I would not recommend reducing the number of "classes". One problem with ordinal data is the lack of discrimination (effective resolution). If you read Wheeler and Lyday, Evaluating the Measurement Process, you can count the number of measurement units (in an ordinal scale the MU=1) inside the upper control limit of the range chart. With ordinal data it is almost always below the minimum guidance (ROT is 5 minimum). Interestingly enough though, when you average the data from an ordinal scale it actually increases the resolution as there are now decimals not just whole numbers. The closer the data is to continuous the more efficient the data and the subsequent analysis. Victors questions about how the evaluations were done is a good point. In hind-sight, you could evaluate within evaluator variation by having the same evaluator score the same treatment twice.

"All models are wrong, some are useful" G.E.P. Box

Victor_G · Aug 7, 2025 4:45 PM

The notation process used here is very similar to a Likert scale and indeed, 5 levels is a minimum recommended, with 7 or 9 being optimum (good compromise between differentiation and level of details without confusion) :
Duane F. Alwin et Jon A. Krosnick, « The Reliability of Survey Attitude Measurement: The Influence of Question and Respondent Attributes », Sociological Methods Research, vol. 20, no 1,‎ 1991, p. 139-181
This optimal number of levels was recently tested through numerical simulation in this study :

Alberto Maydeu-Olivares, Amanda J. Fairchild et Alexander G. Hall, « Goodness of Fit in Item Factor Analysis: Effect of the Number of Response Alternatives », Structural Equation Modeling: A Multidisciplinary Journal, vol. 24, no 4,‎ 2017, p. 495-505

In the case study presented here with little information or results, we don't know if the results are in accordance between evaluators.
Depending on the number of evaluators, their expertise and training, and the "real" differences between samples, the agreement between operators on samples can be very different.

Testing first the agreement between them could be a first step through the platform Contingency analysis - Agreement as the multiple 2 by 2 comparisons could help spot a judge that note samples very differently than the others judges. The notes from this judge could then be discarded if too different. The Kappa statistic could help assess the level of agreement between judges. It might be cumbersome to do it easily on this use case, as with 9 judges this will represent 36 comparisons to do for each pair of judges.
Then, depending on the level of agreement, one could choose to summarize the data/notes (only negative, neutral and positive classes) or use all classes. I had some good results by summarizing the data on use cases where agreement was not very good, as it helped to quickly identify good and bad factors.

Another option to consider is to split the global evaluation into several smaller evaluations ; if judges evaluate taste, maybe you could orientate the assessment by questioning them around several directions :
Compared to reference, are samples more sweet ? More salty ? More sour ? Bitter ? ... With responses Yes/No/Neutral.
Instead of having one 5-levels scale to evaluate taste, the judges note 4 different aspects on a simple 3-levels scale, which can help identify more precisely the tradeoffs and differences between formulations.

Hope this answer may help,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

Meli1 · Aug 12, 2025 03:28 AM

Hello Victor, thank you very much for your answer !

Does this mean that if I used less levels then I would not need to summarize the data from all the judges ?

Also yes and made the judges evalaute the benchmark blindly. However, I don't know how to "re-scale" notes of evaluators based on this reference defined at 0. Also, almost all judges could recognise this reference and rated it zero most of the time.

Thanks again !

Discussions

How do I process DoE sensory data with ordinal responses

Re: How do I process DoE sensory data with ordinal responses

Re: How do I process DoE sensory data with ordinal responses

Re: How do I process DoE sensory data with ordinal responses

Re: How do I process DoE sensory data with ordinal responses

Re: How do I process DoE sensory data with ordinal responses

Re: How do I process DoE sensory data with ordinal responses

Re: How do I process DoE sensory data with ordinal responses

Re: How do I process DoE sensory data with ordinal responses

Re: How do I process DoE sensory data with ordinal responses

Recommended Articles