In analyzing consumer insights for laundry products, the distinction between laundry diary data and panelist-level data is crucial for understanding user experience and detergent efficacy. Laundry diary data offers real-time insights into actual consumer behaviors and preferences, reflecting genuine usage patterns. In contrast, panelist-level data, derived from consumers’ experiences over a few weeks, can be influenced by post-rationalization, where perceptions may shift due to expectations or marketing, potentially distorting product performance evaluations. This highlights the need for relying on laundry diary data for a more accurate assessment of product effectiveness.
To present these insights, we utilize JMP software for advanced statistical analysis and data visualization. Key functionalities include combining data tables for a comprehensive overview, using data cleaning tools to maintain integrity, and applying Partial Least Squares (PLS) modeling to explore variable relationships. Additionally, modeling scenarios and Monte Carlo analysis are employed to simulate consumer behaviors and predict outcomes under uncertainty. By prioritizing laundry diary data and utilizing these analytical tools, we aim to enhance product development and marketing strategies, ultimately improving consumer satisfaction and brand loyalty.
Hi, everyone. Welcome to our brief poster session on connecting consumer and technical data, where we predict consumer product satisfaction by leveraging laundry diary data. Now what do we mean with this? When we do consumer studies, we typically have two different levels of data. One is diary data where we ask consumers to record all of the information that they have while they're doing their laundry at home, and we ask them to record all of the information that we can capture.
For example, what is the washing machine temperature settings? How much laundry goes into the washing machine? How many products do they use? All of this information gets captured. At the same time we also ask consumers to give us one overall assessments on the product which tells us more about what the intent is of the consumer. Now we know that there can be a discrepancy between one and the other because there's some post rationalization happening with the consumer.
However, if we want to make useful impact on the product, we want to be able to understand this more from a technical point of view, and we want to interact with the consumer at the technical point of view in order for them to give us better ratings in the after use part of the data. This session is about how we connect the two sets of data to get to the right insights. We will show you a demo on how we control the quality of our data. We have a custom JMP tool where we can weed out some of the bad data. We will show you how we aggregate the diary analysis and the after use analysis.
We will show you how we then further on model with these data, and then I will hand it over to my colleague, Zhiwu, later on who will show partial least square modeling, Monte Carlo analysis and really go deep into the technical details of the statistics. Further on, this will be the outcome that we will be able to show where we know that we need to maximize top box ratings versus bottom three ratings, which we want to keep as low as possible. Zhiwu will tell you more about why. We can also show you how we get to the final model at diary level and how we understand which metrics we need to focus on when we are designing our products to deliver the best possible consumer metrics. But let me start with how we clean the data.
Let's give an example on how we clean the data. For example, here we have combined after use and consumer data where we have one column where we have a red herring question. This question is intended to weed out those consumers that are not paying attention while filling out the questionnaire. At the same time, we have other metrics we can use in order to get to a better quality database. We have a tool for this which is called the Grinch where we can add all of our direct questions, and we can add our red herring question all in one go.
Here in the sub screen, we need to highlight what are the things we want to screen for. We can screen for out of range data. We can screen for flat liners. We can screen for missing data and red herring misses. We also want JMP to add that new information into an additional column, which we call the Grinch. We add additional columns to explain the issues that it finds because we're asking for different metrics at the same time. For example, for the out of range data, we need to highlight what is the range that we expect on each and every question to get those responses back.
Same thing for the red herring question. These are the possible responses that a consumer could give. Flat liners indicate how big the standard deviation is between the different questions, and it gives us an indication on whether a consumer is just checking the middle box on everything or whether she's really contemplating the responses in the survey. Here for example, we can check whether there's a high percentage of the questions that is answered in the same way. Let's say we take this as 90%.
It will also check on missing data. Then finally, what we need to do is put in the right response that we're expecting a consumer to put for the red herring question. In this case, it was fair. Then we click on submit. JMP does its analysis, and it adds columns at the back end of our database.
Now what we see is there are a few consumers that have missed the red herring question. This is for us an indication that those consumers were not really paying attention while filling out the questionnaire. In this case, we will exclude them from the database. I will also give a quick indication on how we combine the diary data and the after use data into one big file. When we start with the diary information, we use the tabulate function, and we look specifically at the consumer ID.
Now importantly, it's to put this in ordinal. Because if not, we're not gonna get one line per consumer. In this case, we really want one line per consumer as output. We're interested in this case, what the overall satisfaction is for each and every wash load of the consumers. We're gonna drag this one here. In the right format, obviously. We want to have the row percentage. Yes. Not the sum. Finally, we got there.
We have now the percentage of responses at load level for each consumer. So this consumer number 10 answered these many types of each of their loads. They gave a top box response in 7.69% of the cases. From here on out, we can create an additional data table. This information, in turn, we can add to our after use data table.
We can join this with our product information data that we started with in the beginning. This we do by going to the join function, making sure that we highlight the… Just gonna make sure that we open them in the same JMP file. This we can do by going to the tables and join function where we add the new untitled data file, where we look at the consumer number, we put them into the same format, and we match those numbers exactly. By this way, we can add all of this information into one data file. Once we have our set of cleaned data where we have the combined product data and diary data, this is when we can start with the modeling part. Now, I will hand it over to my colleague, Zhiwu.
Thank you, Nathan. Has a great presentation and one for demo to show how to clean the data and how to join the product level data and summary diary data together for modeling. Hello, everyone. My name is Zhiwu Liang, statistician for P&G. Nathan is my colleague for as the product research expert.
This is the JMP project. I work with Nathan to show how the JMP can help us to build a model, link the diary data and consumer response data. In this JMP data, we have the overall consumer satisfaction which is each of the products. One consumer has one product evaluation. Each product give the rating. This rating is based on the two weeks research the consumer gives the perspective for each of the wash load. For each of wash load, consumer will give you the 100 if they're very satisfied, they will give the 0 or 25 if they are disappointed by the wash result.
To use the low level data to predict the product evaluation, which is overall satisfaction for the product, we need to build a model. The first question we need to answer is what model we need to build. Then we first track the correlation between the low data and our overall satisfaction, which we can run in the JMP multivariate analysis correlation.
As you can see here, all of these low level rating, they are highly correlated to each other. That's obviously because if the product is good, then they will give the high rating, either 75 or 100, if the product is bad, we're going to give the low rating. You have the negative relationship between the lower rating, like 0 and 25, which we set as bottom box, and 100 we set as top box because it's top rating, the negative correlation. Because that one and also the load satisfaction from 0 to 100 score because it's from 10 or 8 wash, the rating, we calculate the percentage for each of the rating, sum together is 100. With this constraint, the ordinary lead regression cannot work.
That's the reason we use the tech technology called the partial least squares. In JMP, you can easily to do that. It's Feed Model using the overall satisfaction in the Y and all of the loading satisfaction in the X, they call it the construct model effect, then Personality, change to partial least squares, then run it. We tend to use the Leave-One-Out, the cross validation method, to identify what is the best factor to run PLS. Then in P&G, normally we use the one factor as the solution because we want to keep all of the correct correlation between the predictor and the response.
As you can see here, we have the load rating if the consumer gives 100 percentage high, then definitely consumer give you the very high loading rating for the products. Also, we can see from the profiler, you can see how these things change. For example, if the 100 box, we call it the top box increasing, definitely our overall product satisfaction rating is high because the consumer satisfied in the product performance. Then after identify the importance, the box, the rating, percentage impact to the product response, we also want to see how each of the level has the clear inference to the overall prediction.
To answer that question, first we need to give the constraint here because we have to simulate the data from this modeling prediction. One, based on the constraint we give in the data, because remember, we said all of these rating for the same consumer, they sum together as 100%. This one we have to do adding. The second, we need to do the simulator in JMP because in JMP it's very easy. You can go to simulate the data, then run the easy to generate the Monte Carlo simulation analysis.
Here we just say, given all of the distribution currently in our data, then we use a randomizer to generate 10,000 of the data to do the Monte Carlo simulation, generally what we call the Monte Carlo simulation result. The first five column variable is our predictor, which is the distribution for the consumer, give the answer for each of the rating. Second one, the last one, the sixth variable is the prediction value for the overall product satisfaction. Through this one, if you go to the graph building function, so choose the prediction value in the X and all of the five variable in the Y, clearly you can see the trend. We can clearly see trend.
The 100 top box… 100 rating, we call it a top box in P&G, which give the highest increasing line means if the consumer gives the top box, we almost have high confidence to say, the consumer will be very satisfied for the products. But if a consumer gives the 0 rating, 25 rating, sometimes even the 50 rating, 25. This one is the 50, even if it's good rating, if the consumer not satisfied, then they will give the lower overall score. If we put the equation here, as you can see here, the top box 1% of the… 10% increasing, they will give the almost 4 points overall rating increasing. But if you go to the second box, if consumer gives the 10% rating, that may give you almost 6% overrating down for the prediction.
That result shows us we have to focus on the top 2 box, means the 75 and 100, especially 100, because it contributes most to positively increasing your final product evaluation. Then you have to minimize the value for the 25, 0 and 50 to prevent your product's overrating going down. It's not only that, we also can, using this product level… Sorry, the diary data to evaluate how can we prevent our products, our loading value to be lower than the bottom box and also increase the top box. For example, here is our loading data. If our loading data, same thing, we have the technical measurement for our products. For the confidential reason, we label as the Lx1 to Lx7. This is measurement for our technical consumer evaluation for the products for each of the load.
As you can see here, the pan is the 10, they have a 26 wash for the same products, then for each wash, they will give you the load satisfaction rating, like here. There's different distribution. How can we use that Lx1 to Lx7 to predict, or explain our load overall satisfaction? We use the PLS. Same approach like before I show, you can build a model. Then after build model, you can see, here we have the ranking for each of the coefficient.
Then they will clearly show the technical measurement, Lx1, Lx2, Lx4 is the most important thing because the higher coefficient has the biggest impact to our load overall rating. You see that because of the low over rating now it is not average. We call it top box, means the 100 score. What we do here is we just simply recall this original score 2, the 0 to 100 means every time consumer extremely satisfied, we code as the 100. Every time they don't satisfy or is not extremely satisfied, we give the 0.
Then the model PLS is trying to link this one with the top box percentage. As you can see here, if x1, x2, x3 is in the middle level or a little bit high, then we can have the quite high percentage. But if these two don't satisfy at all, for example, the average value is very low, then you will see the low overall top box will go to the 0 or negative means, of course, they cannot go to negative, but at least it's not very satisfied for the consumer. That indicate this x1, x2, x4 is the focus area to help us to bring the consumer top box high.
In summary, what Zhiwu was showing is once we have all of the data combined, we can create great models and create analysis and great predictions. Well, we learned that we need to focus on delivering the top box, we need to minimize the bottom 3 box results at load level, so we maximize the top box in the after use. Then second, we need to focus on the specific technical metrics, x1, x2, x4, then x3. We maximize the output from a technical to a consumer point of view. Very briefly, in summary, that's it.
Presenters
Skill level
- Beginner
- Intermediate
- Advanced