The notation process used here is very similar to a Likert scale and indeed, 5 levels is a minimum recommended, with 7 or 9 being optimum (good compromise between differentiation and level of details without confusion) :
Duane F. Alwin et Jon A. Krosnick, « The Reliability of Survey Attitude Measurement: The Influence of Question and Respondent Attributes », Sociological Methods Research, vol. 20, no 1, 1991, p. 139-181
This optimal number of levels was recently tested through numerical simulation in this study :
Alberto Maydeu-Olivares, Amanda J. Fairchild et Alexander G. Hall, « Goodness of Fit in Item Factor Analysis: Effect of the Number of Response Alternatives », Structural Equation Modeling: A Multidisciplinary Journal, vol. 24, no 4, 2017, p. 495-505
In the case study presented here with little information or results, we don't know if the results are in accordance between evaluators.
Depending on the number of evaluators, their expertise and training, and the "real" differences between samples, the agreement between operators on samples can be very different.
- Testing first the agreement between them could be a first step through the platform Contingency analysis - Agreement as the multiple 2 by 2 comparisons could help spot a judge that note samples very differently than the others judges. The notes from this judge could then be discarded if too different. The Kappa statistic could help assess the level of agreement between judges. It might be cumbersome to do it easily on this use case, as with 9 judges this will represent 36 comparisons to do for each pair of judges.
- Then, depending on the level of agreement, one could choose to summarize the data/notes (only negative, neutral and positive classes) or use all classes. I had some good results by summarizing the data on use cases where agreement was not very good, as it helped to quickly identify good and bad factors.
Another option to consider is to split the global evaluation into several smaller evaluations ; if judges evaluate taste, maybe you could orientate the assessment by questioning them around several directions :
Compared to reference, are samples more sweet ? More salty ? More sour ? Bitter ? ... With responses Yes/No/Neutral.
Instead of having one 5-levels scale to evaluate taste, the judges note 4 different aspects on a simple 3-levels scale, which can help identify more precisely the tradeoffs and differences between formulations.
Hope this answer may help,
Victor GUILLER
"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)