Converting Binary Responses to Continuous in Product Development Using DOE - (20...

This talk is titled Expanded Uses of Converting Binary Responses

to Continuous Responses in Consumer Product Development.

It's a bit of a mouthful,

but I promise it won't be that complicated.

My name is Curtis Park. I'm a principal scientist at HP Hood.

HP Hood is a company, a food and beverage company.

We make a lot of different milks, nondairy milks.

We also make yogurt, cottage cheese, ice cream.

So a lot of a lot of fun things to taste at work.

I'm a food scientist by education.

A few years ago I was asked to take a look at a problem

that we had for one of the beverages that we were producing.

I'm going to show you a video just so you can see.

But we were getting a lot of consumer complaints

and these complaints were happening

when the product was close to the end of shelf life.

A s you see in this video,

it's pretty obvious why people were complaining.

I think I would complain if I saw something like that too.

It's supposed to be a nice portable beverage.

It's thick and chunky when it's being poured out.

Not what I would expect.

Believe it or not, this product was not spoiled.

I promise you, it was not spoiled.

So I was asked to take a look at this and figure out how can we fix it?

What's the problem? How do we fix it?

HP Hood at the time, this was a few years ago.

We were early on in our journey with using JMP,

and so I was really excited to have an application to use in real life

rather than just reading about it or learning about it.

Naturally I felt like this, like Yahoo! Let's run a DoE, let's do it.

I was really excited

and for those of you who might not have as much experience doing DoE,

the first step is usually taking a look at what factors should I be looking at.

So we did a few experiments.

If you can forgive me, they were probably one factor at a time experiments.

But we narrowed in on what we believed were the key ingredients

that could have been causing the problem.

We ended up making a design.

This is probably the fourth or fifth iteration of the design

that we came up with,

and this was in custom design.

So if you go to custom design,

that's that's the platform that we use to generate this DoE .

A s you can see, this is this is what we had.

So we had ingredients A, B and C, and it was actually a response surface.

So we had all of the two way interactions

and the quadratic terms built into the model.

It ended up being 17 runs, as you can see here.

It's 17 different treatment combinations.

This much A , this much B , this much C for each run.

Once we've settled on this design,

we were really excited so let's go solve this problem.

Piece of cake, right?

You go into the lab, into our pilot plant,

you throw some things together, the beverage comes out.

I'm making it a lot more simple than it actually is.

We made 17 different beverages

and then we put them on the shelf for a little while

because as I mentioned earlier,

it takes a little bit of time for this problem to appear.

Put them on the shelf for a while, sat until they were ready to be analyzed.

This is just a screenshot of a data table.

This has our actual or our design that we used.

A s you can see, there's a column here to the right that I highlighted.

It's our our friend, the Y, our response column.

So once we got to the point where we were ready to ready to measure that chunky pour

now we started thinking, Oh, how are we going to measure that?

Because a t the time, we did not have a chunky parameter.

I've never heard of one. I've never found one.

If anyone has ever found one, we'd love to to see it and maybe buy one.

But it's our knowledge. It doesn't exist.

So what options did we have to measure this?

Because if you can't measure it with DoE it's really not that useful.

So we have a fe w options.

First thing is we can measure everything as a binary response.

So it's either a pass fail, it's good or bad etc.

There's some pros with this and some cons.

The pros would be it's pretty simple to do, right?

Anybody can say pass or fail and it takes you like no time to to measure it.

However, it has some serious cons to it.

Such as, it's really subjective to the observer.

What I think is good,

a colleague of mine might think is bad.

Or even worse, what I think is good, my boss might think is bad.

So it's really subjective.

While it can give you some information,

they don't give us as much information as we want.

Because when you do logistic regression, what you get out of it really are just

probabilities of something passing or probabilities of failing.

In my experience, that's been difficult to communicate

and to really understand what to do with that data,

especially when we're trying to communicate with non-technical people.

So continuous if there's any way to get a continuous response,

that's what we strive for because they give us a lot more information.

We can know how good is it or how bad is it,

because not all good are created equal.

There's another option we could have done

and I would say this is probably the best option

if you can do it, is we could run consumer testing and get consumer input.

What this would look like is I have all our beverages, 17 beverages,

and we recruit maybe 100, 120 consumers of our product

and we have them sit down and rate every single one

for different attributes,

one of them probably being how well do you like how this pours?

The reason why this is a gold standard

is because those are the people's opinions who matter to us.

What we would do is after we get 100 or 120 responses,

you take a look at the data you get,

we can take averages and put those averages into our model.

However,

it can cost a lot of money and it can take a lot of time.

So if your budget doesn't allow it

or your timeline for whatever reason doesn't allow it,

you can't do this for everything.

Sometimes the thing you're trying to measure

isn't such a huge problem that you're trying to solve

that it's worth spending all that money.

But it would still be important to be able to measure it.

Do you have any other options?

I mentioned this earlier.

You can find an instrument that can measure what you're looking for.

Sometimes they exist.

Like I said, I don't know of a chunky parameter.

I looked in our warehouse in our R&D center, couldn't find one.

Even if you can find one,

if this is something that's really specialized,

you're not going to use it very often.

It doesn't make sense to buy the piece of equipment or it could be something

that would be really great, but it requires a lot of expertise

that maybe your R&D, your technical department doesn't have or

just doesn't have the time or resources to to deal with.

I'm going to show you the last option we have here.

What I'm going to say is training a group of people how to rate that attribute

of interest and then let them give you all the ratings.

This is quite as good as having actual consumers.

But here we're trying to take subjectivity out of it

and make it objective.

When well trained humans can be great measuring instruments.

I'm going to walk you through what we've done at Hood

when we have some hard to measure attribute.

We're going to use the case study of this chunky pour.

This is our roadmap.

I'll walk you through this and then we'll actually do it live.

The first thing I wanted to get across

is that the samples that you produce from DoE can be used for many purposes.

I like to tell people that your samples are like gold

and you should treat them like gold.

They're very valuable.

You may do a DoE thinking that you're trying to answer one question,

but something else might pop up later

that you would be able to use those samples to answer that question as well.

I've had that happen to me many times, so sometimes it's good to think about

just ask yourself the question.

I've done all this work to make 17 different beverages.

What else can I do with them?

What else can I learn?

In our case, we use these samples as a "calibration set"

so that we can teach our humans, my colleagues,

how to measure this chunky pour.

So here's our method.

The first thing we do is we review

all the samples with a small group, some maybe 1 or 2 or 3 people that are

really knowledgeable on the subject or are responsible for the project.

What you do is you look at all the samples

and decide which samples should be used to train the Raiders.

We're trying to build a scale essentially, and then we'll take that scale

and we'll get our friends, let's say 10, 15, 20 friends to actually rate these,

these samples for us after we've trained them.

Training step two,

have them read each video, step three .

If it's a video, it could be something else, a picture,

or it could be actually them pouring out the product

if you have enough, etc.

You can get the idea.

Next, we'll take the average of all those ratings.

We'll look at the data, make sure there's nothing funky in there

and then we will use those average values to build a model.

Let's start with, oops.

Let's start with steps one and two.

So we're going t o assume that we've looked at all the all the videos

and the way we typically do it because it's a little easier

is you start off answering the question, which one is the lowest in Chunky pour?

That would be this one right here.

Number one, I'm going to play each one of these.

This just to make it clear, this is our scale.

It's a continuous scale from 1 to 10 and the 1 to 10 is kind of arbitrary.

If if you have something that works better for you then great.

The video right above it corresponds to that.

So this first video corresponds to a one.

So as you can see, while we're watching this video

pours nicely, no rippling and no chunkiness.

Pours as expected. Beautiful.

That's that's the easy sample to identify

and then in the in the sample set, we ask ourselves, okay, which one is the worst?

In this case, it was pretty obvious.

I will tell you again, this product is not spoiled.

So just with changing a few ingredients.

You can see it's so thick, we can't even get it out of the bottom.

So that's obviously a 10.

Then we did a little bit of work

to try to figure out, okay, which one should we consider to be a five?

So halfway in between.

This one, you can see it still flows, but there is chunkiness to it.

Then maybe a two and a half would be this one.

See it has a little less chunkiness to it.

Flows well, probably with normal shaking.

It'd probably be fine.

So there's a little bit of subjectivity,

but you add more people to make it more objective.

Then the last one.

This is seven and a half.

So you can see it's very, very chunky.

The only thing that really is differentiating it from number ten is

that we can get it out of the bottle still flows.

But as you can see, it's pretty thick.

What I would do and basically in this amount of time,

I could train the people that are going to help us

to analyze this, to measure this chunky pour.

Then we'll have them rate once we've trained them.

I'll basically do what I just did.

Maybe we'd take a little bit more time

to be more specific with certain things we want them to be looking for.

If what you're having someone rate is a lot more complicated,

then you'll probably have to need to take more time training people.

This one wasn't pretty complicated

and we're really just looking for people's first impression.

A fter that you have them rate all the videos

i like to use Microsoft forms just because it's easy and I can get the the data

really quickly and easily, but you can use whatever you want,

including paper, although that takes more time and I try to avoid that.

Just to show you what Microsoft, what our forms look like.

Here's a preview of it.

This is as if you're doing it on your phone.

I like to make everything as simple as possible,

and everybody always has their phone, so I can do it on a phone.

That's my goal.

I'm just saying chunky pour doughy,

and then they just go through and rate each one.

So chunky poor for treatment.

One I'll say, don't know that that one was a six

and we're just asking people for the first impression.

There's no right or wrong answers.

Usually people's first impression is right.

So that's why I'm asking people not to think too hard on it.

Maybe number two is a ten, and number three was a three.

I don't know.

They would go through all of those.

Then we would get our data and then using JMP

we would average all those ratings

and then we put the data into the data table to build the model.

So we're going to get out of PowerPoint for a second and we'll go

to excel for a second.

This is what I get when I want to export the data from Microsoft forms.

Like I said, you don't have to use this, use whatever works for you.

A s you can see, ID is the the rater number

just a random number,

not random, but just an identifier for each person.

I left it anonymous so we don't.

We don't criticize people who maybe didn't do as well as everybody else.

And in this case, this actually this data is real from.

I took this to a college class food science class and had them do this.

And so this is actual real college students rating.

The rating the the videos.

And as you can see, we have all these columns, a column for each one.

So person one rated,

rated treatment one and eight, they rated treatment,

two of four treatment three and nine, etcetera, etcetera, etcetera.

So we want to get to put this into jump.

So we have I like to use the jump add in.

So in Excel right here.

And then just as long as you're only highlighting one cell.

And you click data table, it'll import everything.

I've noticed that sometimes

I'll accidentally have like just a portion of the data

highlighted and if you could data table now

it's only going to import what you highlight.

So either highlight everything or only highlight one.

Once you hit that data table button,

you will get something like this.

So this is our data.

We need to in the end, just to show you where we're trying to get

to with this data table because we have to manipulate it a little bit.

This is our data table for the DoE.

We run it was

how much of ingredient A, B and C were in there.

I put, we'll talk about this in a minute

but I put my scale whether or not I thought something passed

or whether or not I thought something failed.

In the end, we need one more column that says Chunky pour.

We'll call it continuous.

And we'll have an average rating for for run one.

Average rating for one, two, three, four, five, etc.

If we look at this data table as it is today

is not in that format because we need all these

columns to be rows and we need the the rows to be in one column.

There's probably a thousand different ways we could do this in JMP

and they're all good and they're all correct.

I'm going to show you one way to do it.

It's just the one that works for me.

First, what we're going to do is we're going to stack

all of the columns on top of each other.

Then we're going to do a summary table

that has the average and maybe we'll also add in the standard deviation for fun.

But the very first thing that I've always been taught to do is when you get data,

you want to look at the graph, the data and look at the plot.

So we're going to actually look at the distribution really quickly.

So if we go to analyze.

There we go. Analyze distribution.

We want to look at the distribution for all of the treatments.

I'm just going to highlight them.

Go to the columns and say, okay.

I'm just looking to see is there anything

weird about this data that we should be concerned about?

When I look at so we can see for 1, 2, 3 , etcetera,

I'm looking for outliers,

like for example, three, everybody rated this sample between 1 and 6.

There was someone up here who rated it really high,

and there's also someone up here that rated this one high.

So what I like to do is if you click on this,

it'll highlight where...

So this this row represents one raider, one person.

So I'm going to see how they rated everything

and you can see they tend to be an outlier.

The nice thing is in JMP is that once you highlight one row,

all it will highlight for all the other responses.

So I can see that, yeah, they rated 3 being higher 4 being higher.

We go down, look.

Terminate. They're opposite of everybody.

It seems like for some reason

the the training, they got a little confused

and they thought higher number meant lower chunkiness and vice versa.

So what I'm going to do is since I have this row highlighted,

I'm going to close this, it'll stay highlighted.

So this is row one.

I'm just going to delete this data

and then we'll move on.

Now we feel pretty comfortable with the data is pretty much solid.

Like I said, we're going to stack the columns.

If we go to tables stack.

It's going to pop up and we just want to stack

all 17 of the treatments.

The nice thing is in JMP 17, now you get this preview.

I love the preview so then I know if I'm doing things right.

What we see here is,

as I can see,

it'll have the ID so the rater and then rate the chunky pour for treatment 1.

They gave it a five and they did number two, a seven.

This is how we want the data structured and we can change the column names.

So instead of data, we're just going to say chunky pour,

continuous.

Then for label, I'm just going to call it run because that's really

what we're going to use this for in a minute.

I just stack it.

So I say, okay, that's how I want it.

Now we have the data table in this way so now it lets us use a summary table.

S ummary tables are nice ways to be able to

make a table of the of different statistics.

So what we're going to do is we're going to highlight

the chunky pour continuous column and say statistics.

Do mean.

For fun in case we want to use it, we'll also say standard deviation.

This just gives us the overall mean and standard deviation.

But if we want to do it per run,

I'll highlight, run and put it here in group.

Now when we look at this preview, we have one through 17

and conveniently, they're in order.

One, two, three, four, five, six, seven, eight.

All the way to 17.

We have the mean and the standard deviation.

So we're going to say, okay.

Okay, so we have one more table.

Now we're to the point where we're where we need to be

because I have each run as a row

and have a column for the average column for the standard deviation.

So what I'm going to do is I will highlight this column.

If you go to edit copy with column names

and then I'm going to go to our original data table.

We're gonna make a new column here

and say edit paste with column names.

There it is.

I should have done both of those at the same time, but I didn't.

So we're going to do.

Do this one as well.

Okay, so now we are ready to do our modeling.

So first, first thing I want to show you

is what we would get if we just did pass fail our binary response.

What we'll do is if we go to analyze fit model.

Because I made this this design in JMP in the custom design platform,

it automatically knows what kind of design this is

so that's why my model is already built.

If there is a really convenient way,

if you knew this was a response surface design,

let's say, let's say this wasn't here.

The macros are convenient.

If I highlighted ingredient A, B and C.

Said Macros Response Service.

It pulls it all up. It already knows what I'm looking for.

So that's helpful.

I put it in the y axis, the variable, the response y chunky pour pass fail.

What it gives us is nominal logistic.

I'm not statistician,

so I'm not going to go into any of the statistics behind what it's doing.

I'm just going to show you how what you get out of it

and what a scientist might be looking at.

So if I say run, our target level is passed.

So when it's going to do probabilities and probability of passing.

So we say run.

This is what we get.

So, I mean, the first thing that a scientists like

myself would probably look at is this effects summary.

I'm looking at probably looking at P values and I say,

well, nothing significant except ingredient A.

There are other things that we would look at, but I'm going to...

I'm going to go over that.

We're not going to cover that today.

Instead, I want to just look at the profiler,

because that's what we find, at least in our in our experience,

the profiler being the most useful and easiest to interpret

for the scientists and when they're communicating with others.

So what this is, is I'm going to make it a little bigger.

Is on the left here.

We're going to get a probability of failing and a probability of passing.

So if we have 0.13 of ingredient, a 0.12 of ingridient B,

0.45 and of ingredient C,

and it's actually 0.13%, 0.12%, 0.45%.

I just didn't change it.

It's a very, very small proportion of the formula that we're changing

anyways at those levels,

this says 100% of the time we're going to pass.

If I move it up, let's say to..

Have like,

say point two of this ingredient now.

Now, looks like we're going to pass only 64% of the time.

You can see these curves,

how I changed ingredient B a little bit and ingredient C,

maybe we can get back up to a point where we pass 98% of the time.

You can play around with this.

But the problem with this is, is like I said earlier,

passing.

Maybe this pass right here is not the same as passing over here.

However, we don't really know that with this information,

and it's kind of hard thing for some people to wrap their head around,

like it was just probability of passing.

What do I do if if all I can get is an 85% pass rate?

Like, let's say hypothetically, this was the best we could do.

What do I do with that?

So that's why we're looking at continuous responses.

I'm just going to close this and we're going to do that,

build that model again, except let's do it for the mean

of our continuous scale.

So we're going to have to remove chunky pour

and we're going to add the average here.

We're just going to say run. Keep it simple.

Do the effects screening report.

Now you can see there's a lot more information going on

that we didn't get before.

So where before, if you remember,

all we saw was that ingredient A had a really low P value.

Everything else was like 0.99.

The conclusion was ingredient A does everything.

Well, it's not actually the whole truth, as we can see here.

Yes, ingredient is the most and most important.

The main effect of ingredient right here.

But B and C also have a role to play.

While not as big, they're still an important role.

So we look at our actual predicted plot.

It looks pretty healthy.

Our lack of fit. Look s good.

I'm not going to go into all the details of everything that we look at,

mainly because I'm not statistician.

That's just what I look at. I'll look at the lack of fit.

I'll look at the residuals to see if

there's anything weird, the studentized residuals.

Then really, I come to the profiler

and now you can see this gives us a much different picture,

much more complete picture, where as I increase ingredient A,

the chunky pore increases, but increasing these these ones does too.

So they they also have a role to play.

If we were to

say that we want to minimize it, I think it's pretty obvious what the...

Desirability is going to come out to being.

But just to show you,

we're going to you go to the red Triangle by the prediction profiler.

Optimization desirability and we're going to do the desirability function.

Then here, this is the desirability.

I find it useful.

You can change it in the red triangle,

but I find it easier if you just hit control and then click on it.

Now we can change what our goal is.

So in this case, we want to minimize this because we don't want it right?

We don't like chunky pour .

Consumers don't like it either.

So we're just going to say minimize and okay.

Now we can go back to that optimization and desirability

and say maximize it.

What I thought I was going to do.

Say, take these two ingredients out.

Put this one as low as you can.

You'll get the the lowest chunky pour that you can.

In reality, we had some other constraints, so we couldn't do that.

There were other factors at play, but this definitely gave us

a really good idea of where we needed to go,

what was important and how do we control this chunky pour

to the point where when we implemented the changes, the complaints went away.

It's been good ever since.

That is the the nutshell

of how you could take something that is hard to measure.

It's really subjective.

It's binary so you pass fail or good or bad,

and you can convert it into something that's continuous.

It's a relatively simple method.

You can use it for a number of things.

As long as you have people available to help you out,

you can you can measure a lot of things that could be considered hard to measure.

Where do we go from here?

At Hood.

Just to give you an example of some other things that we encountered.

This one, the Chunky Pour, is actually one that's a little easier to do.

But let's say this is another product we were working on a long time ago where

let's say you have coffee and you're going to add some foam to it

and you want to understand how well does that foam dissipate into the coffee?

That's a that's a tough thing to measure.

We definitely don't have any instrumentation

that can really measure it.

Videos really helped us to understand how we could measure it

and get some useful information out of...

As you can see, we're trying to measure how does that look?

How well does it move that one versus, let's say, this treatment over here?

You can see they're quite different.

Where one moves really fast, the other moves really slow.

This one looks kind of chunky the other one didn't so much.

That's that's how we use it. We use it quite often.

I appreciate you taking the time to listen to my talk.

Hopefully, I hope that this has been useful.

You'll be able to find a way that you can implement it to

in in your day to day work.

Thank you.