DOE and Consumer Research: Tackling Consumer Preference Variability
One of the great things about humans is that we are all unique. Diversity has benefits all around us, but variation in preferences with regard to consumer goods makes it challenging to predict what will delight the greatest number of people. The use of design of experiments (DOE), facilitated by JMP, is critical to designing formulations with the most appeal.
The objective of this presentation is to interactively present methods to optimize formulations for the greatest consumer liking. Two DOE approaches from a real food formulation case study will be presented: an 18-run definitive screening design (DSD) and an 18-run space-filling design modeled using SVEM and neural networks.
To prepare data for modeling, multidimensional scaling is demonstrated to remove anomalous participants’ data. Participant clusters built using hierarchical clustering are used when fitting models using fit definitive screening and SVEM neural networks. The power of the JMP profiler will highlight how consumer preferences differ by cluster. Text Explorer is used to show how to verify insights gained through modeling by exploring verbatim comments by study participants. Lastly, insights gained from each experimental design and modeling approach are compared along with limitations of each. Attendees are presented with a more information-rich alternative to traditional DOE and consumer testing strategies.
Thank you. I'm going to be talking to you today about using DoE and consumer research: Tackling the challenge of consumer preference variability. I'm wondering if many in the audience have ever played the game Oregon Trail. For those who aren't familiar with the Oregon Trail, you probably have to be around my age, living in the United States. Most kids when they were in school, played Organ Trail on the computers when they went to computer lab.
It's basically you're an old, very old computer game where you're a group of people trying to cross the plains in the United States and get to Oregon, and along the way you have a lot of challenges. One of the things you have to do is cross a river. This is a screenshot from the game where we come across a Kansas River in 1848. If we get there, and we have a decision to make, it says that it's cold outside, it's in March. That makes sense. River width 628 feet, the depth is 3.4 feet. We have five options. We can ford the river, float it across, we can take a ferry, we can wait to see if things get better, or we can just get information, and then what is your choice?
I think most of us know better than just to think, "Okay, it says river depth at 3.4 feet, does that mean it's always 3.4 feet? Or are there spots where it's 6 feet? Are there spots where it's 2 feet? Are there spots where it's 10 feet?" In which case, if you're trying to for the river, and you come across a ten-foot spot, you might be in a bad, bad situation.
I think intuitively, we know that using the average value is not always the best approach. Launching a new product is like crossing a river in Oregon Trail. Except a lot of times we'll do consumer testing and a lot of companies use the average consumer score because to be honest, it's easier. But unfortunately, not all consumers are like the same things or like the same things for the same reasons.
No matter how well you try to screen out only the people you're looking for, when you're doing your consumer test, you still have people who prefer different things. What we, what we would prefer as formulators, people in R&D, like myself, we'd like the consumers to be homogenous in what they prefer, but it's not usually the case, because if they were homogenous, then we can just develop one product, and we're good, but instead, people have different preferences and that's why the grocery store is the way the grocery store is today.
What am I going to show you? I want to show you how you can use JMP to get more information out of consumer testing data, and especially when you're using DoE to gain more information, have consumers and to try to optimize formulations.
As a secondary objective, we're going to take a look at how well different models and experimental designs can predict future consumer liking scores. We have two examples, definitive screening design, and we also did a space filling DoE using, and modeled it using SVEM, and then neural networks.
A bit of it, I'll give you some background on why in the world were we doing this. We were asked last year to improve a product, it's an existing product, the management comes to us and says, "You know what? We think it can be better, go make it better."
The timeline that we were given didn't make us feel super comfortable with being able to do a lot of sequential DoE, so we decided to run a six-factor 18 run definitive screening design, and then made those samples and submitted them for consumer testing to see if we could model liking and then come up with the right formula.
After we modeled the data, we had three solutions that we thought could work and could beat the control in a consumer test. We had, a subsequent test was done where we had those three formulas along with the control formula, and basically the goal was find one of those three… At least one of those three that's better than the control.
We had high hopes because the model was telling us these should score really, really well. It turned out that they were only marginally better than control and didn't give us warm and fuzzy feelings about whether we should move forward with those formulas.
It was a bit of a letdown, but we still learned a lot of information. The project's already over, but a year later, we were thinking about how do we do our work in R&D, and how do we gain confidence in new approaches? Because when you're under the gun, when management is saying, "Make it better, do it faster," you're going to go to what you know when you've done what's more established.
What we wanted to do was, and I in this project was let's turn back the clock and say, "What if we had done something different instead of a definitive screen design?" Could we get more out of this consumer testing data that would help us, because it seemed like, the model says we should be able to get super high scores, and the truth of the matter was they seemed to be a ceiling, and that ceiling was limiting us.
What did we do? We had two tests. There's one in April 2023 that was a definitive screening design. That was the original 118 formulas. Then a year later, we had one in April still, space filling design, 18 formulas. Both DoE's to make it fair as far as we could, they had the same exact factors and the same factor ranges. We were trying not to take learnings from the first DoE and apply it to the second one and then bias the results.
Then we made those DoEs, and then we wanted those models to see… to predict this validation test right here, so none of these four formulas were submitted in each of those other two tests.
In more detail, what am I going to show you? One thing we won't show too much of… I'm not going to spend too much time on the definitive screening design, but what we will show is we're going to take a look at the consumer liking data, and we're going to explore what can we learn from it by clustering the consumers, the panelists that came in.
For anybody who isn't familiar with traditional consumer testing, you will have a central location where the test is being performed. You'll prescreen who you want to take the test, so users of your product or the right demographics that you're looking for, they'll come in, and they'll taste the products, they'll rate them, and you'll take those scores and do what we're going to do here in a minute.
In these tests, I believe we were using somewhere around 120 consumers in both tests. We're going to use hierarchical clustering and multidimensional scaling to explore anything interesting we can get out of, who is taking this test and what are they looking for.
Then after modeling the data from the space filling design with those neural networks, we will look at the profiler and see, is there anything we can better understand what drives the consumer preference? I'll talk about using the average data… The average value for all consumers versus clustering it and using those averages and seeing what we could see.
Then the last thing we'll touch on, like I said, the secondary objective is we're going to look at the prediction performance of the definitive screen design compared to the space filling design on that validation test set. How well did they do? We'll have to take that with a small grain of salt because we only had four samples to test against. But even still, there are some learnings there.
With that, the rest of this presentation I'm just going to be showing you in JMP because I know, I personally like it when, when others show me how they did things in JMP and I find that more powerful.
Let's minimize this. Just for reference, here's the definitive screening design. These are the factors right here. We had ingredient A, B, C, D, E and there was an attribute that we said was either going to be low or it was going to be high. When we do the fit definitive screening, we come up with four out of the five ingredients are significant and the attribute. Ingredient A, didn't seem to matter much in anything that was tested.
At the end, here's the profiler. You can see, for the most part, people like the more of this stuff you put in there. For the most part, the more they like it. There are a few interactions, like you can see if I increase C. If you look at ingredient D, it kind of flattens. That was where we got that model from.
Let's talk about the… We're going to talk first about the consumer testing data from the space filling design. That's what this is. There are a lot of other questions that we ask people, but for the purposes of this talk, I'm going to focus on overall opinion because that's the one that we care the most about.
That's like the… The first question that we ask after they taste the product is, is how much do you like it? It's on a nine point scale. It's called the hedonic scale. One is dislike extremely, nine is like extremely. We're looking for nines and, higher score the better.
For reference, typically we'll see good product score in the low sevens, marginal products will be in like the mid-6s, and then below six is something you should really, really be concerned about.
The first question we wanted to ask with this consumer testing data was, is our population of consumers, are they homogenous or do they have different preferences? What can we learn? If you go to analyze clustering and hierarchical cluster, we can figure that out. I'm just going to go to the script here first.
What we have here is a heat map of all the panelists. Down here on this horizontal axis, we have the overall liking for each sample, so there were 18 samples. Then on the vertical axis, that's each individual consumer, what they rated, like this person right here, they rated sample one a seven. They gave sample two a two, so they didn't like that one, but they loved sample three. They gave it a nine.
There are a few things, just by using this heat map that we can recognize the colors well. I like green, so that's why I did green here, but you can pick different colors. You can see a dark shade right here. These are people who scored everything, pretty much a nine. They loved everything we put in front of them, or they were just clicking through one of the two.
We do ask people verbatim to ask to tell us why they liked it or why they disliked it, and so that does help us to narrow down whether this person was taking the test or just clicking through, so they could get their compensation. I can say I went through these, and these are people who truly loved everything we put in front of them. Then this group down here, they were just hesitant to cut. Give everything a nine, but they gave, basically gave everything an eight.
When we do the clustering, you can say how many clusters you're looking for… I'm definitely not an expert in clustering, but when we looked at this data, it seemed like there were about three clusters. We had… Hey, Christian, can we pause it for a second?
I guess it just keeps running, [inaudible 00:14:19].
Oh, okay. Give me one second. There's someone at the door.
Sorry about that. I forgot my mic was down up above my head.
That's right. That's fine. I'm not sure where it was.
You had been or were you mentioned the coloration between nine and eight, and then you were going to talk about, you said something to the effect of not being an expert in clustering, and then you were looking for something at that point.
Okay. Yeah, I'll just… I have to ask them to cut this part.
Yeah. I'll let them know. I think I'm going to put. The goal for me is to watch it and then identify the time points to tell them where stuff went sideways, so no worries.
All right, sorry. I didn't know that anyone was coming to the house.
No worries. I've been… I worry about the same thing. I guess, are you roughly ready?
Yeah.
Okay, I'll turn off my mic. Give me 3 seconds, and then go for it.
Okay.
Okay.
Okay. When we look at this heat map, we can see people who rated everything high, the dark greens, we see the light greens, people that rated everything also pretty high, but just, they didn't want to use nines, they used eights instead.
When we did the clustering, we found that there are three clusters, and they're colored here with a red, green, and blue. You can see that it's the red, there's a lot of dark. The green, it's in the middle. There's a lot more light colors in there. Then the blue, there's even more light coloring.
If we look at the cluster, this summary here, the cluster means you can see it visually. This is the mean score per cluster. Cluster 1, you can see pretty much everything. They're always scoring everything higher. They're a lot more forgiving, I guess we could say they like pretty much everything we put in front of them, which is great. Cluster 2, you see as a mid-range. Then Cluster 3, they're harder to please, or they're just more discerning or something.
What's interesting is when they cross, that's when we can tell that there's something different in their preferences, one cluster versus another, so that clued us in.
The other thing we wanted to look at these consumers is to see, is there any other anomalies that we should be concerned about with the data we're given? Because data cleanup is probably 80% of what we have to do when we're talking about data science. What we wanted to look at was the multidimensional scaling, because that can tell us, "Gives us a map of consumers that are close to each other and farther away from each other."
To do that, if you go to the red triangle and go save the distance matrix, what we get is the matrix of how far away one panelist or consumer is from another. Then to do the multidimensional scaling, go to multivariate methods, multidimensional scaling, and our Y columns are all these consumers, and it'll take just a couple seconds to do this. Then we'll take a look and see, are there any panelists that are a lot different from other people?
Here's our map that we just made. One thing… A couple of things that JMP out to me, at least. First of all, most people fall right here. Makes sense. Most of us are… When you're looking at the average consumer and how they use the scale or how they rate the products or how much they like things, it makes sense that most will be here.
There's a weird pattern right here. If we want to explore these patterns, what you can do if you highlight them, if we just highlight the column name and then say row selection, select all matching cells, it'll go to all the other data tables I have that have the word name and those panelist numbers, 002, 1027.
This is the original data set that has those panelist names, and now I can see what were they scoring. This person gave it 9, 9, 9, 7, 9, 9 and all 9s pretty much. That's interesting. We go to the next group that's highlighted here. 9, 9, 8, 9, 8, 9, 8, 9. They like everything, and they like it a lot. Go to the next group, 8, 7, 9, 9, 9, 8, 9.
You notice here this, what I have on the right here is dislike verbatim, this column, they're saying, what do you dislike about it? Nothing, nothing, nothing, nothing, nothing, nothing. We see a pattern with these consumers. That's what this is telling it. These are the people who scored everything the same that we saw previously, like the dark green bands. We caught that here in the multidimensional scaling.
Then we also have these people who are out way in, I guess, this is right field. What's up with them? If we do the same thing here, we're going to select all matching cells. Then to find them, I just do the next selected. I see something completely different, 2, 7, 3, 1, 9, 7, 2, 6, 9, 2. I was having a tough time figuring out what makes this person different. Because if I go to the next one, another one that's highlighted here, 4, 7, 3, 2, 5, 3, 1, 6, 1, 1 like, once again, we see, it seems like just a random pattern, but when I go into the verbatims, it makes sense. When they gave it a nine, they said there was nothing they disliked. When they gave it a one, they said it tasted of spoiled ingredient bees. I can guarantee that these were not spoiled, but we can see that what they're rating is true.
Then somebody told me… I was trying to figure out what makes these people different. Somebody suggested that I look at how they're using the scale, and you notice that this person is using the entire scale. I personally thought that all these products were pretty good. Some were better than others, but not to the point where I'd say I dislike it extremely. It's the worst thing I've ever tasted and a nine saying, this is the best thing I've ever tasted.
To figure this out, if we go to make a summary table of the overall opinion, I want to do the standard deviation and the range by name put in the group. We do that, and now I have a data table that has each panelist and what their standard deviation of how they were rating all the samples. When you sort this descending, you'll get… We can see, let's say we got our top 10 most variable panelists here.
Again, if we do… I highlight name, and if I say, select all matching cells and look what's highlighted here. All of these people, what's differentiating them is not necessarily what they like or what they don't like, it's just how they use the scale. This was just a long way to… Just a way for us to have confidence that the data we have is good and that the people were doing what we asked them to do.
If, let's say, I had found out some reason that was obvious to where their data wasn't good, then we'd exclude it. But there's no reason why we should take it out, so we left it in there. That's just a little demo of how we use our clustering and multidimensional scaling.
Then the other thing I wanted to share here on clustering is, at the end of the day, what I want is I'll show you the data table, the DoE table. This is a space filling design that we did. These are the six factors. What I needed was a column. The overall liking for Cluster 1, the overall liking for Cluster 2, 3, and then just overall.
To get these three one easy way to do that was to take this cluster means table.
That's from the summary. If you right-click and say make into data table, it comes up exactly what I want it… Not exactly what I want. It comes up with every sample mean is a column. But I want to switch these, I want the clusters to be columns and the columns to be rows. I want to switch rows and columns.
If you go to transpose under tables, and we want to transpose these columns, and we can label it by the cluster, say, okay. Now I can just basically copy and paste this. I can copy and paste it into my DoE data table. That's what I did. I don't need to do it here because I've already done it. We're just going to close all this. Now we have the three clusters. We used the neural networks with SVEM to model this data. I'm going to show you the profiler of what we get out of it. Here's ingredients A, B, C, D, out through E, and the attribute low and high. When we break this out for by cluster, we see something very interesting.
First of all, when we did the definitive screening design, it said ingredient A wasn't influential at all. It wasn't significant, so we took it out. Here, all of a sudden, it looks to me like ingredient A is doing something. Even more interesting is the fact that we seem to have two groups going against each other. Because, you see, as I decrease ingredient A, Cluster 3's averages go up, but Cluster 2's averages go down.
That means Cluster 3 does not like, doesn't want too much ingredient A, whereas Cluster 2 likes it. Then we see the same thing for ingredient B. The more ingredient B you put in, the more Cluster 2 likes it. The more ingredient B you put in for Cluster 3, the more they dislike it. That for us, solved a bit of a mystery as to why there's a ceiling when we do our consumer testing, why there's a ceiling with how much people will like something, why our overall liking scores did not go as high as we thought they should. Our consumer group is not homogenous, and we have people who like differing different things.
What do we do about this? That's the harder question to answer. That's a strategic question that a company would need to make because do you go after everybody and satisfy the group on the top they like everything? All their scores are always really high. They'll be fine, but then the Clusters 2 and 3, those people are marginally satisfied. That's one approach. The other approach is Clusters 1 and 2, their scores were higher to begin with. They must be more passionate about this product, so we should focus all our efforts on making their satisfaction the most.
There's also the thought of, okay, Cluster 3, maybe we just need a different product offering for them that they would like better. The more you know, the more questions you typically have, but the more you can do and the more successful you'll be. That's what we got out of the profiler here, which was extremely useful.
The other thing, there's always the question of, okay, the profiler says this, but is it true? To try to decide… Try to figure out whether or not this is actually true, that Cluster 3 doesn't like ingredient A or B, and vice versa for Cluster 2, we decided to look at the verbatims and see what we could see in there.
If we go to the original consumer data table, we have the overall opinion over here what cluster they are in is on the right. Then we have the dislikes, so we can not necessarily look… We also have like verbatim, we're going to look at the dislikes. We're going to use the text explorer to see if there's any credence to what we're seeing in the profiler.
We did the text explorer by cluster. For Cluster 1, what is the most used term? Nothing. This is in dislikes, so that's good. That's what we want. For dislikes, we want to say nothing. You can right-click and say, "Show text, and we can get an idea of what people are saying." There's nothing I dislike. There's nothing I dislike. Nothing I can think of. Great.
Cluster 2, nothing is the most used term but as you can tell, it's not like it was in Cluster 1, where Cluster 1, it's 292 versus the next one is about half as much. Here, it's almost the same. People saying nothing and people saying flavor, and usually it's them saying, "Lack of flavor, not enough flavor, weakened flavor, low ingredient B flavor."
Then we have ingredient B for Cluster 2. We want to see what they were saying about ingredient B. What Cluster 2 is saying is, not strong enough ingredient B, not enough. Need more, not enough. Not enough ingredient B. Ingredient B taste was very strong, so we have one that was very strong.
Keep in mind, again, this is all DoE samples. Just across the board, they're saying it could use more ingredient B. The more ingredient B you put in there, the more I'm going to like this. That checks out with what the profiler was saying that Cluster 2 likes more ingredient B.
What about ingredient A? What does Cluster 2 think of that one? In this case, there's a lot of, they're saying way too much. That one's a little bit harder to read. There are some people that say, I didn't think it had enough ingredient A tasted bland. Not enough ingredient A, not enough ingredient A.
Keep in mind, they don't… When they give us dislikes, they don't have to say anything about ingredient A. You have to take this with a little bit of grain salt. They're only saying it when they think it's off. But now if we go to Cluster 3, all of a sudden, flavor is the most used phrase. Ingredient B is right here, number three, so what are they saying about ingredient B now? Instead of them saying too much ingredient B, this cluster is saying, not enough, not enough. There's a couple of too much, but too strong, too much, too much, too much. Slightly too much, strong, there are some not enough in here.
But you get the idea that there is a lot where we didn't see any, in Cluster 2 people saying that there's too much ingredient B. Now we have people saying that there's too much, which checks out with what the profile was showing us, and gave us more confidence that these two clusters do differ in their likes for ingredient A and ingredient B.
Then, if we go to see what they said about ingredient A, too much, too much. A little ingredient A meaning too much, a little too much. You can see that, that also checks out. They don't want too much ingredient A, they also don't want too much ingredient B. This is how we've… It's a little bit of just exploration. I'm sure there's a lot more we could dig into in the text explorer to put some numbers behind this, but we were just using it as an exploratory to see… We just use visualizations like graphs to explore data. We're using Text Explorer to explore what people are saying, and it checks out with the profiler.
In the end, what we found is we found an answer to why there seems to be a ceiling with how much we can… How high our scores can be. We also now discovered different preferences for different groups of people, which is really powerful information that we did not have before, and now once you have that information, we can do something about it.
Okay, so the last thing we're going to do is take a look at how well the definitive screening design predicted against the space filling design. We only have four validation samples here, so we have to take it with a grain of salt, but we're going to see if there is any indication of does one do a better job or the other.
We have four different models. If we use the cluster means and weighted them based on how big the clusters were, just using the average of all consumers, then we have two different models that were developed from the definitive screening design. One is just using fit, definitive as is. Another was something that a colleague was taking a look at.
What do we see? The first thing is there was a bias between the first year, the scores. For whatever reason, we're higher in 2023 than when we did the test in 2024. Just overall, everything was just shifted down when we did the space filling design. I think that's just due to variability in testing or maybe the people, for whatever reason, were in a bad mood. I don't know.
As a scientist, as an R&D scientist, what gives me a little bit of heart in this model is that the line is not quite parallel, but it's almost parallel with the line that I want it to be. If it was a 100% accurate prediction, this is y equals X, that's what this black line is. It seems to do a decent job with the limited information that we have, considering the biases there.
Here's, if we took just the sample means completely and there's no difference, but here's where we see some differing results. Here's the fit definitive, and they're not close to being parallel. We see that the lines cross. It gives us… It doesn't give us as much warm fuzzies that this model is really able to predict future panelists scores.
Then this other one was having even tougher time. For example, this one over here predicted a value of 7.9, but got us a value of seven. That's where we were disappointed when we went and did the validation test in 2023.
I hope that this was useful and that you gained a great appreciation for the diversity of people, and especially diversity in our preferences for our food. I hope that you are able to use this to greater enhance your ability to gain more information out of consumers, and thank you for listening.