Choose Language Hide Translation Bar

How to Design and Analyse Experiments with Pass/Fail Responses

It is not unusual for individuals unfamiliar with how to properly create and analyse pass/fail experiments to treat the data as if they were continuous. Incorrectly assuming binary responses can be handled in such a fashion can lead to disastrous results. Failing to consider an underlying model consistent with these types of data, it is easy to create a design with too few or too many runs, not knowing how to properly estimate the sample size needed to achieve a certain power. Analysing results as if they were continuous fails to consider the fundamental nature of the data and how it might affect model assumptions. Doing so might produce unrealistic results, such as probabilities below zero or above one.

This session focuses on designing, evaluating, and analysing experiments for binary responses such as pass/fail. Using two of the most common functions for modelling binary data, the logit and probit, various functionality in JMP and JMP Pro is illustrated, including the use of simulation to estimate power and building prediction formulas. Analysis options for fitting these types of models will also be explored.

 

Okay, so let's talk about designing and analyzing experiments when we have pass/fail response. I'm going to talk about pass/fail responses throughout this session. However, this is really going to apply to any case where we're dealing with a response that has two categories, that's binary, that you can be one of the two categories or the other. They're mutually exclusive and exhaustive. Pass/fail, go/no-go, yes/no, as long as that response has got two mutually exclusive and exhaustive categories. We'll break the session up into four different parts. We'll start with an example. Maybe, hopefully, it'll look familiar.

Hopefully, it'll be something you might have seen in the past in terms of dealing with an experiment with past fail responses. We'll get into a little bit of things that can go wrong. After that, we're going to talk about the model because everything depends on the underlying assumptions I'm making about the model, and where that drives the direction in terms of the analysis that I pick and in terms of sizing the experiment. Once we're through with the PowerPoint slides, we're going to actually see how I would analyze the experimental data, the example data that I'm going to present.

We'll talk about some of the options that are available in both JMP and JMP Pro. Then finally, we'll wrap things up with how do you size an experiment? As it turns out with pass/fail experiments, it's not the experimental runs, it's the number of experimental runs, the number of trials per experimental conditions that matters. We'll talk about how we set up in JMP Pro, how we set up the simulation that will allow us to determine this number of trials is good enough. Let's talk about our example, so we're making widgets. We have a process where we're not particularly happy with the failure rate.

We identified four different factors that we think are going to help us improve our failure rate for these widgets, we've got an additive, we've got a compression step, where there are three factors that we're going to vary. We've got temperature, pressure, and hold time that we're going to vary. The question is, does the widget have a defect? We process the widget. Yes, it has a defect, no it doesn't have a defect. If it has a defect, we're calling it a failure. Current defect rate is 15%, and we're going to use that in the end to size our experiment.

Our goal is to bring the failure rate down to five percent. However, we would be happy if we were to be able to identify a difference of about three to five percent. If we were able to detect that our defect rate goes down to, let's say, ten percent, then we would be happy. We would want to be able to detect that. The other thing All I want to know is that each run takes about 10 minutes. I'm not unlimited in terms of as many runs as I want. I can do about six runs per hour.

All right, so let's say we put this factor into JMP. We open up a custom designer. We decide we want a response surface design. We're going to add three center points, and this is the design that we come up with. So far so good. We take the design, we send the design to the operator who's going to run the experiment, and they return this. We've got 24 runs, about 4 hours to run this experiment. We're told that those ones and zeros represent an observation with a defect or an observation without a defect. Now, I go to analyze the data, I put it into JMP.

Now, I know that those ones and zeros aren't continuous data. It's really categorical data. It's binary data. I'm going to I treat it as such. Maybe I know enough that what I'm dealing with is called logistic regression. The model, because I've designed this experiment in JMP, the response surface model comes over in my model construct dialog box. I go to fit the analysis, and right off the bat, I should be worried. Towards those parameter estimates, I get this message saying that those parameter estimates are unstable.

Let's say I talk this over, with somebody who's a little bit more knowledgeable about these types of experiments, and they tell me, well, problem is you only did one trial for a set of treatment conditions. You really need to do more than one to be able to get a good estimate. Let's say I go back in, and I talk this over with the operator because they don't like the idea of bumping up the size of my experiment from one trial to five trials. Now I've got an experiment that's going to be about 20 hours. But I convinced them that we really need to do this, decide to rerun the experiment where I've got five trials now, for each set of experimental conditions.

I get my data back. I've got two new columns that came back with my experiment. I've got that Y5 column that corresponds to the number of times a failure was seen. I've got the column with the N in it that corresponds to the number of trials, for a set of experimental conditions. Now, this time, rather than working with the count data, I think, maybe what I'll do is I'll divide the number of failed trials, divided by the number of total trials, and I'll treat that as continuous data. Here I've got the proportion of failures for a given set of experimental conditions.

I'm going to treat that as continuous data. It makes me more comfortable because I'm more familiar with modeling data this way. The same model that we saw from the last time, however, rather than using logistic regression, I'm just going to use standard lease squares. All right, so far so good. I use my favorite modeling technique, maybe I'll use my favorite model reduction technique, and then I come up with this final model. Again, so far so good. Next thing I want to do is I want to optimize.

I want to know what conditions are going to drive that, error rate down as far as possible. I use the optimizer within the profiler, and it tells me set the whole time at 45 seconds, set the temperature at 45 degrees C, and that should give me a predicted error rate below zero. I might be willing to dismiss this because this is close enough to zero to say, well, maybe it's just really close to zero, it's not below zero. But I know that technically this is not possible. I cannot have error rates under zero. Even though I might run with this, I might say to myself, well, how accurate is that estimate?

How close am I really to zero? Have I built a bad model? Everything gets back to the model that I've... The assumptions I've made, when I treat the data as continuous and the assumptions that I should probably be making When I've got pass/fail data. This is our typical model. We have on the left-hand side, we've got our responses, our observations. In our case, it's the probability of a failure. I'm sorry, on the left-hand side, we've got the probability of a failure.

On the right-hand side, we've got our linear models, what we're trying, the relationship we're trying to build between our inputs and our outputs. This is usually what we consider. The typical assumptions in this case is that that relationship is linear, that my response range is unbounded, and that my residuals are normal. My residuals are, I fit a model to the data, whatever's leftover, that's my residuals. I assume that they're normally distributed, which means they're symmetrical around zero, and they've got all these nice properties. That's a part of the problem.

If we're dealing with probabilities, our response really looks something like this, where I've got something I think that's more of an S-shape than a linear relationship. That linear relationship really doesn't hold. Unbounded response range doesn't hold either because I know since I'm dealing with probabilities, can't have anything below zero, I can't have anything above one. Even the final point, the assumption of normal errors doesn't work. If you think about it, as I get closer and closer to zero, when my predicted values get closer and closer to zero, my residual I will have less and less space to operate. That between my fitted value and zero, it starts to shrink.

I cannot have symmetrical errors close to zero. Likewise, I can't have symmetrical errors as I get closer and closer to one. Something has got to give in this case. Now, I would hope that there would be a fix to this problem that's simple. A simple transformation, probably. Hopefully, and maybe the use of a different distribution. Can we do this in JMP? The answer is we've got three different options. The first option, which we've already seen in the example, and that's logistic regression. That is using the Logit transformation on the data.

In this case, we're not looking at the probabilities as a response, we're looking at the ratio of our two different response levels. The probability of a failure divided by a probability of a pass, and we're taking the log of that. We're taking that log to make the relationship a linear relationship. Another way to look at this is since we've only got two different levels, it could be the ratio of the probability of the failure divided by 1 minus probability of failure. Now, if you've ever used these kinds of models before, and you've heard the term odds and log odds, this is exactly the definition.

That very first part of the equation is the definition of what odds are. It's the ratio of one level to the ratio of the other level. That's my odds, and then I'm just taking the log of that. That's one option. Another option is to use what's called probate analysis. With probate analysis, I'm just using a different transformation. In this case, I am using the inverse of the normal distribution. Or if I've ever built this in the formula editor or using JSL, that's just the normal quantile function.

Now, there is a third option that's possible in JMP. It takes a little bit of work, and we're not going to go into a whole lot of detail in terms of its use, but I do want to present it because you do see it a lot in the literature, and that's the complementary Log function. Now, not available directly in JMP. We need to use something like the nonlinear platform to build it yourself. Can be approximated with, there's a distribution called the Smallest Extreme Value Distribution, but We'll have to leave that for another day. Really, we're going to focus on the stuff that's available out of the box in JMP.

Now, there's a second big assumption that we need to make here, and that is, even given the transformation that we pick, whether it's Logit or Probit, we're going to assume that the error is as a binomial distribution. Because it's two parts. Not only am I doing the transformation, I'm also assuming a different distribution. As I had mentioned, the first two are available directly in platforms in JMP, so they're easy. We'll focus on those. The last one can be modeled using the nonlinear platform. Nice thing is that everything is available from the fit model platform. I've got three different options within JMP.

I can use the logistic regression, which, again, we've seen that in the example. There's an option called generalized linear model, and there's a strong relationship between nominal logistic regression and generalized linear Then for JMP Pro users, we have generalized regression, which gives me a whole bunch of other options as well. Now, there is another platform where you can do simple logistic regression that's fit Y by X. But, because we're dealing with an experimental condition, and we usually have more than one factor, I'm not going to talk about that. But for simple logistic regression, fit Y by X works as well.

Before we get into how I implement this in JMP, we need to talk about the data organization because how we organize the data, what's the data we know how to organize the data? It's easy to set it up in the fit model dialog box. Let me give you a quick, simple example. Let's say we have a two-factor experiment, time and temperature. We've done a full factorial, so we have all possible combinations of time and temperature. Let's say we plan to do five I have trials per experimental treatment.

By treatment, I'm talking about row, our unique set of combinations, our row in this data table. I've got three different ways I can set up the data. The first way I could set up the data is just as raw data. Every row in my data table corresponds to one trial. You'll notice that very first row in the raw data, it's a treatment combination one, with a given time and temperature. In this case, my binary response is whether or not my response is green or red. That first response was green. The second row is treatment combination one again, and that response was red and so on.

That's the raw format. All three of those approaches can use the raw format. Now, I can summarize. Sometimes it makes more sense to summarize the data. When I summarize the data, I am summarizing over both the treatment condition and the response level. You'll notice that, for the first row, in the summarized stack, I've got treatment condition one, I've got all the greens, and I've got two cases where I saw green. Second row is my second treatment combination, three observations of green, and so on. What I've just aggregated, over my treatment combination and my response levels.

All right, all three of the approaches can use that summarized stack format. The third organization, and probably the most natural for a lot of people because this looks a lot like the organization I would use for a linear model, and that's the summarized split format. In this case, I've aggregated, again, over treatment combination, but only for one of my treatment levels. You'll notice that in this case, I've counted the number of times I saw a green observation for each trial, and I've also counted the number of trials. If I were to, let's say, build a DOE using the custom designer, this is the format that I would expect to use, if I were inputting data.

If I wanted to use the summarized stack format, I would have to actually duplicate these runs. I would have one for my green, one for my red, and just reorganize things a bit differently. Now, the logistic regression cannot use this format. I've got to use one of the other two formats. The generalized linear models and the generalized regression can use this format. Something to note if you plan on using logistic regression. Okay, so let me do this. Enough with the PowerPoint. Let me go into JMP and let's talk about how I would have analyzed each one of those examples, with the experimental data that we saw.

Okay, let's do logistic regression. We'll start with the first example, logistic regression. I also notice that I've got my summarized stack I've got a format here. I've got a column that has got my passes and my fails in it, the count. This is aggregated data. I've got the count also in another column. All right, so pretty straightforward. Everything we're going to talk about is going to be under fit model. Because I designed this in JMP, it comes along with my model.

This is my response surface model where I designed the experiment. It's already pre-populated my dialog box here. But just All I want me point out is that here I've, got my response in this case is nominal. That's pass/fail. I also need those counts in there because I've aggregated my data. I'm going to pick my target level, and this really depends on the orientation that I want to see my output, whether I want to do things from the pass orientation or the fail orientation. Let me just set this. Let's put it back to pass. Now, I've got a couple of different choices here.

Both of them are logistic regression. I've got nominal logistic regression. If I know a little bit about stepwise regression, I can also perform stepwise regression on my Logit model, on my logistic regression model. That's an option as well. All right, so let's just stick with the original nominal logistic. I'm going to run my model. Again, I'm going to use some data reduction technique or some model reduction technique. I might just go in there, and manually remove items from the model and so on. Or I could have used stepwise regression for this.

It really, It's not important because I got the right model to begin with. Let me remove one more term from my model. There we go. Let's say this is our final model, we're happy with this model. Let's go into our Profiler. Let me scroll down to the bottom to the Profiler, make this a little bit bigger, so we can see better. Now, one of the benefits of using this model is that my profile, my passes and failed, I am bounded, but We have to have probabilities between zero and one. Let me turn on my Desirability Functions. Now, this is going to look a little bit different because it really is split into two groups.

I've got my proportion of passes I want to target and the proportion of fail. I'm just going to drop those failures to zero because I don't want to optimize for any failures whatsoever. I'll increase my Desirability on my passes and, let's go ahead and maximize my desirability. It tells me, again, the same whole time in temperature. However, I get a better estimate of what that predicted failure rate is or the predicted pass rate in this case. This is telling me that I should have about, with those settings, I should have about a 96% pass rate, so about approximately a four percent failure rate.

Certainly not below zero and a little bit further away from zero than I might have expected. Okay, so let's move on to our generalized linear models, and we're going to use the summarized split format. Again, this is something that if you use linear models before this, this should look familiar, this format. I've got these are likely the observations that came from my design experiment. I've counted the number of observations in one of my levels, and in this case, I've counted the number of failing items, and then I've got the total number of trials.

All right, again, I'm going to go under my Fit Model dialog box. This is set up a little bit different. For this particular organization, I have two continuous variables. The first one has to be the count of my individual level, and the second has to be my total count. I've already defaulted here to Generalized Linear Model. I picked my distribution. In this case, it's defaulted to Binomial. Here's where I have the option of looking at Logit or Probit. We got slight differences in terms of the probabilities. Let's just go ahead with Logit. Now, by default, the Logit and Logistic Regression are going to give me the same results.

What makes generalized linear models a little bit of a benefit is I have the ability to relax some of the assumptions that we saw earlier. One of those assumptions is that the error term follows a binomial distribution. With binomial errors, they're fixed, they're the same regardless of what that probability is. Turning on my overdispersion says, well, the variability is a little bit bigger than I expect it to be, or the error might be a little bit bigger than I expect it to be. It relaxes that assumption of binomial errors a bit.

I can also adjust my estimates as well for potential bias. I'm just going to turn on my overdispersion. Again, I'm going to run my model. Now, one of the disadvantages of this approach is that you don't have the stepwise ability here. You need to go in there, and either do this manually or have some model assumed before you decide which final model you want to work with. Again, I'm just going to do this manually. I'm going to come up with my three important factors. You'll notice if you were to compare this to the logistic regression, because I had that overdispersion turned on, my variability actually is a little bit larger than it was estimated.

My P-value is a little bit larger. Here I need to ask myself, well, do I want to really believe that even though these are not significant, that they're not worth staying in the model and so on. But the nice thing is that I've got the option of having to be able to relax that particular assumption in the model. Again, if I want to turn on my Profiler and do my optimization, that's available as well. I've got the same nice abilities. In this case, a little bit different organization because I'm modeling not past fails, I'm modeling probability. It's the profiler that we're more accustomed to seeing. Obviously, in this case, I don't want to optimize this because I would be maximizing my failures. I'm going to minimize that. Let's go ahead and optimize that.

Again, it's telling me all time, 45 seconds, temperature, 45 seconds. It's going to give me a slightly different estimate, although they're very close. Again, the nice thing is that I am bounded here. It doesn't matter where that slider goes, I am bounded between zero and one with those probabilities. All right, so for the final example, let's talk about generalized regression. Now, this is a JMP Pro only feature. It gives me a couple of other additional options. I get same place to start, fit Model, same setup because I've got my summarized split data.

However, I am going to pick generalized regression. Now, I am going to pick a different distribution. In this case, I know that my response because it's set up with two different variables here. I've got a binomial response, or I can pick beta-binomial. Beta-binomial is an analogous technique to using that overdispersion option in my generalized linear model. This allows me to account for a variance that changes as a function of where the probability is. We'll stick with the default in this case. Let's go ahead and run the model.

If you've used generalized regression before, you know that all of your model reduction techniques are available from this platform. What I would probably do in this case is probably use one of these techniques. Let me make sure I've got my... I'm going to enforce heredity. Let's use forward stepwise regression. I've got a whole bunch of different options here. If you've never seen a generalized regression before, I've got a whole bunch of nice different options in terms of my model reduction techniques. I'll use forward stepwise, which is the default for stepwise regression. Click go, and here's my final model.

Again, very consistent. All three of these models are showing me very consistent results. Parameter estimates are going to be slightly different. Probabilities are going to be slightly different. Okay, so now the question is, have I run enough trials to be able to really make this experiment worthwhile? That gets us into how do I right-size my experiment? Let me switch back to my PowerPoint slide deck for a couple of short slides on setting up the experiment.

The way we're going to do this, the way we're going to size the experiment, and then, by the way, this is a feature that's only available out of the box in JMP Pro. For those non-JMP Pro users, you might be able to write JSL script for this, but it would be very cumbersome. JMP Pro makes this extremely easy in terms of setting this up. We're going to start with the custom designer. What we're going to do is we're going to build our intended design using the custom designer. Before we generate that final table that we see, we're going to turn on simulate responses.

That's going to be in the hotspot in the upper left-hand corner. Once we create the data table, we're going to get a dialog box where we can change the coefficients. So on the next slide, I'm going to talk a little bit about how do we set what those coefficients are? We're going to change those coefficients. We're going to change them to our desired magnitude. Again, I'm going to go into a little bit more detail on the next slide. Once we change the coefficients we're going to set the error to be binomial, and we're going to select the sample size, the number of trials that we want to see, and then we're just going to click Apply.

What the simulate response option gives me, it's going to create, because I picked binomial errors, it's going to give me two columns, one column with the number of simulated trials, trial successes, and the other with the total number of trials. That's what we're going to use to do our overall simulation. Let's talk about how we set these coefficients because this is a crucial step in terms of sizing the experiment. How we set these coefficients are going depend on the underlying model. That is whether I pick a Logit or a Probit model.

My example is going to use the data that we currently saw, use the example that we currently saw with the Logit model. That's going to determine how we set those values, the baseline probability. If you remember from our example, we said that, well, at baseline, we've got about a 15% failure rate. That's going to be our baseline. Then we have to ask, well, at what probability am I going to want to be able to see a difference? Again, recall, we said that if we were to be able to see a change to 10%, then we would be happy. Then that's the sizing that we want on our experiment, this ability to see a change going from 15% to 10% improvement.

What we're going to do is we're going to actually use the transformation that we saw to calculate values at our baseline, at our 10%, at the probability we want to detect. We're going to take the difference of those two values, and we're going to use that as our coefficient. If we're using this Logit baseline is just that 15% divided by 1 minus the 15%. Our baseline is -1.73. If we're interested in seeing a change to 10% failure rate, again, I'm just going to plug that into the equation. It's 0.1 divided by 1 minus 0.1. That gives me a negative 2.2. That makes the coefficient just the difference between those two values. I'm going to use 0.5 approximately as my coefficients.

Enough PowerPoint. Let's go to see how that is done in practice. I have my experiment up, my experimental factors. I'm just going to use the custom designer. Let me go ahead and load my factors there. I'm going to build the design I want. If you recall, I said, well, I have a response surface design. Let's add three center points there. Let's do this in 24 runs, say. I'll just go ahead and click Make Design, let that crank through and determine the best design from the algorithm. Before I generate that very last table, I'm going to go to the hotspot, and I'm going to say Simulate Responses. Now, I'm going to create my table. That table gives me my design, and I'm going to change my coefficients to the values that I calculated.

Let me just go ahead and reset those. I said my intercept was -1.73. I'm going to set each one of my coefficients to 0.5. Let me just copy this, so I can paste these. Finally, I'm going to set my distribution, my error distribution to Binomial. Now, I'm going to click the Apply button. I want you to keep an eye on that table. What that's going to do is it's going to generate those two columns that I'm going to use for my analysis. Let me back up one second. I wanted to set my sample size. I can't forget to do that. If you recall in the original experiment, I had five runs. That very second time I ran the experiment. I'm going to reapply that. Now I've got the correct equations in there.

These are the columns I'm going to use for the initial analysis and for doing the simulation. Now, if you wanted to, you could actually sign these effects. By signing, I mean making them negative or positive, depending on the relationship you expect to see. For the sake of time, I haven't talked about that. I have a slide that goes over how I might go through these different effects and say, well, should that be positive or should that be negative? That can be done as well. Certainly, it will affect the results to a lesser extent that the actual magnitude of the effect sizes will. But insomuch that with most designs, unless they're completely orthogonal, there's correlation in those effects, and whether an effect is positive or negative might have some influence over my power values.

Now that I'm done, now that I've actually generated my initial design, I'm going to go into my Fit Model. I'm going to analyze the design. I got to make sure to put in my simulated and trials. I typically will use, with this particular situation, I will use generalized regression, and again, I'll use Binomial because of all of those options in terms of model reduction. The nice thing is that with the simulation, those all carry through. I don't have to worry about writing any scripts to be able to do that. Let's go ahead and run that. Again, I'm going to use forward selection. I'll just click go. Let me just scroll down to my forward selection model. It's right here.

Now that I've got my single observation, what I can do is I can right-click. I will typically hover over one of the probability values. I will right-click, and at the very bottom, there's an option Simulate. I have to make sure that that formula column with the simulated number of successes is selected in both these, and by default, it should be. I'll specify the number of sample trials I want to do. Now, once I click go, what's going to happen is that every iteration of the simulation, it's going to recalculate that formula column and give me a new set of number of trials given my underlying model.

Going to do that for 2,500 runs, and then it's going to aggregate that data into a nice report. Now, rather than spending time and watching that crank away, it's quick. It usually only takes about half a minute, maybe a minute. I've actually run results, and this is what the results look like. Here I've got my 2,500 runs, less that very first initial run. I've actually added some additional functions here, formulas here, which I'll be happy to share. But these are the two scripts that get added when this report gets generated. The Power Analysis is just a subset of the distribution. Let's take a look at the Power Analysis.

What this does is it gives me a whole bunch of data in terms of for all of those runs, what is the distribution of those P values? If you recall the underlying model in this case, everything was significant. What this tells me is this is the distribution of my P values that I saw through my simulation that my predicted value is about 1 out of 10,000, so that's good. I get a calculation of my confidence interval around my predicted P value. Actually, I don't think it's a confidence interval since we're doing a Bayesian approach.

This is more of a credible interval. Maybe it's a confidence interval. But it is going to bound where that probability value is. But an interesting and important piece of information is the number of times I have rejected my, in this case, the null hypothesis. This tells me that if I were to pick an alpha level of 0.01, I would have rejected that null hypothesis about half the time, even though I know that, in fact, I model it to be a real effect. If I go to 0.05, I've rejected it about three quarters of the time and so on.

This is the information that I use to ask the question, have I sized my experiment correctly? Will I be able to detect my main effects, my main effects and two-factor interactions, any other effect on my model, sufficient a number of times, a sufficient proportion of times to make me happy with the number of runs that I've picked? What I have often found myself doing in these cases, so not only do I rely on some of the built-in reports, but what I'll do is I will count, for example, the number of correct significant effects.

I'll put in a cutoff value, and calculate the number of correct responses. Again, this will all be packaged together with the material, with a little bit of explanation, and the formulas of how I would build all of these models. Again, I really need to use my own definition of risk to determine, have I captured a sufficient number of those significant effects to make that experiment worth running it for 5 runs or 10 runs or 20 runs and so on. All right, so that wraps up what I wanted to show you in JMP. Let's go back to our summary.

To summarize, when I am modeling this data, I'm going to use the Fit Model platform, regardless of which of the techniques that I use. I've got logistic regression, in which case I could use either logistic regression or stepwise regression with logistic regression in the background. I've got generalized linear models, which by default is going to give me output same as logistic regression, but with the additional ability to relax the assumption on my errors and to correct for bias. Then I've got generalized regression for if I have JMP Pro that allows me to do, what I would see in logistic regression or generalized linear model, plus the additional benefit of allowing me to do some model production as well.

When I size the experiment, I'm going to build my experiment using the custom designer and then use JMP's built-in simulation function to be able to run my bootstrap simulation experiment, my bootstrap simulation analysis. That wraps it up for what I wanted to talk about. There will be additional information in the supplementary slides, which I unfortunately don't have time to go through, but I hope you've enjoyed what you've seen, and hopefully, it's something that you can use going forward, and it's going to be beneficial. Thanks a lot.