Defect Subpopulation Models (2021-US-30MP-885)

David Trindade, Founder and Chief Consultant, STAT-TECH

Suppose a small proportion of a population of components is highly susceptible to a certain failure mechanism. Consider these units to be manufacturing defects in a reliability sense: working originally and not detectable as damaged by standard end-of-the-line tests and inspections, yet after a short period of use, they fail. The rest of the components in the population are either not susceptible to this failure mechanism, or else they fail much later in time for possibly some other cause.

Occasionally, field or life test data indicates this bimodal behavior by showing an early incidence of failures that seem to slow down to almost nothing long before a significant proportion of the population has failed. A good model for this kind of behavior is the defective subpopulation model (DS).

The population of components is really a mixture of two populations: a reliability defective small subpopulation that fails early and the remaining components that may possibly fail later due to wear-out mechanisms that only are seen much further out in time (designated as “immortals”). This model, where a (hopefully small!) proportion of the population follows an early life failure model, has wide applicability.

This talk first describes how such data can be analyzed using the Life Distribution platform in JMP through the DS options for fitting the data and then demonstrates how to apply a simple test of the hypothesis that there may be a defective subpopulation based on the output of the Life Distribution platform.

Auto-generated transcript...

Speaker	Transcript
David Trindade	Hi, I'm Dave Trindade.
	I do private consulting and training, especially in use of JMP, and my company is STAT-TECH. Today we're going to talk about reliability defect models.
	When we're analyzing reliability data, there is a common assumption in the analysis that all units on stress will eventually fail for a specific filtering mechanisms. However, how do we treat reliability data that doesn't seem to follow that assumption?
	Let me give you an example over here. Let's say we have a reliability stress test, and we take 100 units and we run it for 1,000 hours. We see 30 failures by 500 hours, but no additional failures occur in the next 500 hours, up to the end of the test.
	So there's a question over here. Suppose we had taken the surviving 70 units and continue them on the stress test beyond the thousand hours.
	Would we have seen any additional similar failures, or would have been no failures? So the question that we're asking is, are we dealing with two different mixed populations, or is the data just behaving randomly?
	Here's a second example, and this is actual data from a major computer manufacturer's incoming inspection test.
	It's readout type data and the company was investigating gate oxide type failures. Now readout data is data in which units are tested periodically, you know, time 0, 24 hours, 48 hours, 168 hours, 500 hours and 1000 hours.
	The sample sizes that were conducted on test are shown over here; it started off with 58,133.
	And that continued up through 48 hours, and at 48 hours a whole bunch of units called censored units were removed from test for almost 48,000.
	Now, censored units are surviving devices that are removed from test following a readout, so then 10,000 units continued on to 168 hours and there was one additional failure.
	And then another 8,000 were removed at that point, censored, to allow 2,000 to continue on up until 1,000 hours and doing 1,000 hours from 500 up the interval 0 to 168 to
	500 hours. There was one additional failure and then between 500 and 1,000 hours, there was another failure. So we ended up censoring at 1,000 hours, 1,998 units. The company assumed that the failure distribution was lognormal.
	Okay, so let's say we want to analyze this in JMP, a ssuming a lognormal type distribution. We have what is called multi censored, because there are many censoring points, interval type data.
	The JMP data table is created as shown on the right, and it takes a little bit of thought to put it together, but let me start over here. We have two columns, one a start column, and one an end column to reflect the end of the intervals.
	So the interval from zero to 24, we had 201 units, rejects, fail. Now at 24 hours, though,
	we had no censoring, okay, and then we had the second interval from 24 to 48 hours in which 23 additional units failed. But at 48 hours we removed a large portion of the units under the stress, almost 48,000.
	And then 10,000 were allowed to continue on. And that 10,000 between 48-168 hours had one reject and then at 168 hours we had 7,999 units removed, almost 8000 again.
	The censoring point in this data table is reflected by a blank cell, now it's what we have over here, so those represent the censoring points.
	And then between 169 and 500 hours, we had one additional failure among the 2,000 units that were left. And then finally up until 1,000 hours, we had one more additional failure after 500,
	leaving us with 1,998 units that were censored at the end of the test.
	So the important point about entering the data in this kind of a table for multi censored interval type data is that you have to set up a separate row that treats the censored observations, and the endpoint will be represented by a blank cell.
	So now let's launch this into JMP. We're going to select analyze, reliability and survival, life distribution and provide the input that's shown. And so we're going to put in the start and end columns, both of them together and then we're going to put in the frequency column.
	The censor code that we're using was zero for a censored observations, and then we're going to change it from the Wald method to a likelihood method. I prefer using the likelihood method.
	And what I'm going to do now is I'm going to go into JMP.
	Okay.
	And do the analysis and then come back to the slide. So let me reduce the slide over here.
	Okay, so this is the major manufacturer's reliability data. This is how the data was entered; I talked about that. We're going to come up here to analyze, reliability and survival, life distribution.
	Okay, we're going to put into start and end column over here and we're going to put in the frequency. The censor code is zero, and again I'm changing this to the likelihood method. We click OK.
	OK, now the assumption was that the distribution was lognormal. JMP gives us a nonparametric estimate over here.
	And we click on lognormal and we're got the nonparametric scale, the linear scale, over here. And we look at the data and we see, even though we have few data points, it doesn't seem to fit a model very well, very few points are on the line.
	If we go to a probability plot scale over here, again we have the same situation, where this would be the model, but the data points are not falling on the line. The other thing I want to notice over here is when we look at the parameter estimates over here, the T50
	is 4.8 times e to the 30th power.
	And the scale is about 25, for lognormal distribution, that's a very, very large number. JMP also provide you with quantile profilers over here and let's say
	we're trying to estimate the time to get to a half a percent failure. Well JMP can, based on this model, it tells us that we are going to take about 535 hours. Well, what about 1% failures? We put in .01
	and now we're up to almost 273,000 hours to just double the failure count. That is a weird type situation, so let me go back into the slide presentation now.
	Okay.
	Back in here.
	Okay.
	So this is now reviewing what we just went over. We have the parameter estimates over here, this is our linear scale.
	Okay, this is our probability scale and again these estimates, I want you to take a look at very carefully.
	This T50, the median, and the scale. Well let's let's interpret these a little bit. A T50 is 4.8 times E to the 30th hours.
	And the sigma is 25. Okay, 4.82 times E to the 30th is 5.2E to the 26th years.
	By comparison, the age of the universe is estimated to be 1.48E to the 10th years.
	So to reach T50, it will take beyond the end of the universe, which to me, is a pathological issue over here.
	And the lognormal profiler estimates 535 hours to reach a half a percent failures, but almost 300,000 hours to reach 1% failures.
	This is not a very satisfactory distribution to work with. Okay, so what we...what is our alternative? Well our alternatives are think in terms of a defect model.
	And the defect model basically pits mortals versus immortals. In contrast to the usual assumption that all units on stress can eventually fail
	if a defect is subpopulation exists, only the fraction of those units that contain the defect may be susceptible to failure.
	And these are called mortals, because they can fail, they have the defect. Now if the units don't have the fatal flaw,
	they cannot fail for that failure mechanism and they're not susceptible to failure for the observed cause. And these are called immortals, immortals can eventually fail, but very, very far out in time and most likely for other reasons.
	So let's consider a defect model now for the data over here, and we're going to make some, I think, reasonable assumptions.
	Looking at this incoming inspection data, we see that, you know, the rejects occur at 201, 23 and then, after that, even though it was on a reduced sample size, there were very few failures.
	So what we're going to do is make an assumption that nearly 99% of the failures occurred by 48 hours.
	Okay, and that makes implying that the sample size that we're starting off with of of rejects...
	defects in the original 58,000 number about 227, just devide this by .99 to reject counts that we saw.
	So practically 100% of the mortal failures have occurred by 168 hours and we're going to make an assumption that any failures thereafter not likely related to the defective subpopulation and could, for example, be handling induced.
	So let's go now into setting up a separate column in the JMP table
	that incorporates this idea of having a defective subpopulation. So a possible model would imply a fraction mortal subpopulation and the defect, as I said, at 227
	out of 58,000 or about .39%, so about less than about .4% of the units in this original sample had defects in them.
	And now I'm going to add a Mortals column to the data table as shown, so I just put in the rejects that we observed.
	And I make a slight assumption over here that part of the units that were censored, this very large population of
	48,000, about two of them contain...there were about two rejects...mortals that were removed. And then we put in the other failure that occured at 168, we treat it as a reject, but from then on, there is zero failures occurring. So let's analyze this now in JMP.
	Okay, and I've set up a separate column over here with the mortals frequency, the one that I just showed you. So now we're going to come up here to analyze,
	reliability and survival, life distribution. Okay, we're going to pull in the start end reject columns but now we're going to use the mortals frequency for the frequency in there. Censor code is still zero, and again I like likelihood.
	We click OK, and voila, we get this.
	It's only two data points but OK, and I'm going to click the lognormal scale for this.
	But now, when we take a look at our T50
	and our scale parameter, we see the T50 now is about 10.6 hours.
	Okay, and the scale is about .6. Looking at our quartile profiler, it tells us point...see about a half a percent failure takes about two hours and to see about a 1% failure,
	okay, takes a little bit over two hours. These are far more reasonable type values, and this is one of the basis of having a defect model and making the assumption, you can get results that make more sense. So let's go back now into our
	PowerPoint presentation.
	Okay, so this just captures what we've done to life distribution, so you can, if you want to spend time to go through this yourself, you can go through this. I will provide the JMP data files
	on the JMP Community board. Okay, so we ended up with now a T50 of 10 point Sigma compared to something that was you know exorbitant before.
	And a Sigma of .68, which are very, you know, reasonable type values in our estimates. So what this example is showing you basically if you don't consider mortals versus immortals in the analysis, you can get,
	you know, strongly biased results, an incorrect assumption can strongly affect the results, and if you're projecting field reliability
	the results again can be highly biased unless the existence, the possible existence of a limited number of the defective units is recognized and taken into consideration.
	How do you spot a defective subpopulation? Well, one thing that's very easy to do is a graphical analysis.
	Let's assume that a specified failure mode follows a lognormal distribution. We plot the data using a lognormal probability plot. If instead of following a straight line,
	the point seem to curve away from a cumulative percent axis, it's a signal that a defective subpopulation may be present.
	And if the test is run long enough, we expect the plot to bend over asymptotically to the cumulative percent line that represents
	the proportion of defects in the sample. So what I mean by this is here's a probability plot, with all the...assuming that everything can fail.
	And at some point you hit a...basically a stop point that, no matter how much time you increase on the units that are on stress, you get no more additional failures, and that point represents the cumulative percent defects in the model.
	Okay, now if we just look at the mortal subpopulation and plot that data, now we get a straight line fit that we would expect that would support a fitted distribution model.
	Okay, so what is the model actually that we're talking about? What we're observing is basically a CDF of models that's been multiplied by P, the fraction of the models within that population.
	For example, if you had 25% mortals in the population and the mortal CDF at that time is 40% failures,
	then what you would expect to observe in the total sample would be about 10% (.25 x .4) in the total sample at time T.
	Okay, so let me give you now another example that goes into a little bit more detail on this and it actually shows the power of JMP to analyze defective subpopulations. For a semi...for a type of semiconductor module, it's known that a small fraction
	of hermetically sealed modules can have moisture trapped within the case, which increases the likelihood of a failure mechanism that causes these units to fail early.
	And what we'd like to do is fit a suitable life distribution model to these failures.
	Now test parts were especially made to greatly increase the chance of enclosing moisture that would be typical of the defects in the normal manufacturing process, and then we selected 100
	randomly parts from those units and we stress tested them for 2,000 hours. What we observed was 15 failures, and here are the times at which the failures occurred. By the end of the test, 85 units were still surviving. Okay, and these were censored observations.
	Now, since we have exact failure times and a single censoring time, the times are entered into a JMP data table in the following way. Okay, so we entered the times in a column
	and then we entered the censoring observation, 1 for non censoring, 0for censoring, in this case, and then the frequency, the number of failures that occurred at that given point in time. But at 2,000 hours
	we have zero censoring code and then 85 units that are still surviving. OK, so now we've got this JMP data table.
	We're going to use kind of a legacy platform now because it presents a graph that I like to show.
	We're going to select analyze, reliability and survival, and we're going to use survival this time, rather than life distribution.
	So we're going to cast the columns as shown, similar to what we did for the life distribution. We're going to put in the failure times, these are exact failure times.
	We're going to put into censored observation column, we're going to put in the frequency column. We're going to plot failure instead of survival and, very important, we changed the censor code to zero. So let's do that now. Okay up here. I'm going to close this guy out.
	And we don't need this guy anymore, so I'm going to come up here to
	close this guy out too.
	Okay we're going to go to this defect model over here. Here's our failure times and our censoring and our frequency. So I'm going to come up here, analyze, and I'm going down here to reliability and survival, but instead of going to life distribution, I'm going to go to survival.
	Okay, I put in the failure times.
	I put into censoring times, censored column, the frequency. Over here I changed my censor code to zero.
	And I'm going to plot failure instead of survival. Click OK.
	Now I have this plot over here and let's...going to you know, improve this a little bit.
	Let's take it up here to...
	right about there. That's good, I'm going to add some grid lines just for kicks over here, make it easier to interpret and I'll do the same thing over here at grid lines.
	Okay. So what do we notice about this data, which is a sign that immortal population may exist? We noticed the units were following along a fairly, you know, linear type of plot for the CDF.
	And the last failure occurred about 2,000 hours, excuse me, the last failure occurred about 1700 hours, I think was the actual data point, let me just go back down, about 1664.
	When we noticed that in the 300 hours prior to that last failure, there were five failures, but from 500 ...from actually 1664 hours on,
	the 300 hours...336 hours, there were no failures. So the point that I'm saying over here is, we see some kind of a
	saturation point over here in the failures, that in this 300 hour period, which is the same as this 300 hour period,
	this 300 hour period saw five failures and then the next 336 hours or whatever it was saw no failures, so we suspect now
	that we have possibly a defect subpopulation because we're not seeing additional failures that we would have expected had this curve continued on. Okay, so what we're going to do now is we're going to go in here
	and we're going to you take a look at JMP using the lifetime distribution program. Okay so go over here to reliability and survival, life distribution, and now what we're going to do is put in our failure times.
	okay, exactly at times. We're going to put in the censor column, put in the frequency.
	and then the censoring code is zero. Again likelihood.
	We click OK.
	And now, what we have over here is a nonparametric plot of the data. Now JMP gives us a whole bunch of distributions that we can fit to this data. Let's choose, you know, to go up here, we can say fit all non negative type distributions because reliability distributions are not negative.
	And JMP comes out and it gives us a generalized gamma one. I'm not going to use a generalized gamma, though because I'm going to use the lognormal because that was the one that was originally planned
	for the analysis by the data. And we can see over here and it's not that different, by the way, from the generalized gamma, if you look at the
	lognormal as far as the AICc type criterias, you can see that the data is not a good fit. We have a lot of data points that are, you know, above on one side of the line. If we use a lognormal scale, okay, over here,
	we see again that, you know, the data points are not falling on the line. Okay, we also can take a look at the estimates...of the parametric estimates of the data as shown over here. Scale parameter of one and T50 of 6,000.
	Okay, so we have a analysis, right now, that is not giving us a very satisfactory fit to the data, so I'm going to go back now to my PowerPoint type presentation
	over here.
	Okay.
	This is just repeating again using the survival platform, we saw no new failures in the last 336 hours, compared to five failures in the previous 336 hours leading up to that point.
	So we suspect that a possible defective subpopulation. How do we analyze that in JMP? Well, it turns out that
	we can actually solve for the parameters in the model that I showed you for that using the method of maximum likelihood. It's just a simple extension of the maximum likelihood theory for censored data.
	The basic building blocks of the maximum likelihood equations are the PDF f(t) and the CDF F(t), however, only a fraction p
	of the population is susceptible to the failure mechanism that's modeled by F(t), then pF(t) is the probability a randomly chosen component fails by time t.
	I showed you that, with the example of 25% and 40% mortal giving you a 10% observation. Similarly, the likelihood
	of a randomly chosen component failing at the exact instant t is given by the PDF times p.
	So what we're going to do is use the rule for writing likelihood equations for the defect model is just substitute pf (small f) and pF wherever f and F appear in the likelihood equation.
	Now the likelihood equation, the standard equation for type one censored data, that means that we have n units on stress.
	We have r failures at exact times T1, T2, all the way to T sub r and the remaining N minus r units that are censored that are surviving, this is the likelihood equation for that situation, that's Type one sensor data. Now we're going to make it
	an equation that can handle a defect model in which only a fraction of the population is susceptible to failure by just substituting for fft and Kappa fft p times fft.
	n P times Kappa fft and obviously, when you're multiplying multiple times you take p of r outside.
	So ML estimates are the values of P and the population parameters, they maximize
	the likelihood equation or equivalently minimize the minus the log of the likelihood an easier equation to work with. And it turns out that's what JMP does. JMP has the maximum likelihood estimates in the
	life distribution platform. So again we use this original table, 9.5 data table as I showed you, we entered the data, as was shown putting in the failure times, the censor, the frequency and make the changes over here.
	And now we got this output, as I mentioned, was a nonparametric type output.
	And then, when we told JMP to fit all non negative type distributions, that's assuming that all the units can fail, we decided to work with just a lognormal type distribution and we click that.
	Our output appeared in this way, which showed us a lognormal fit that was poor.
	And when we looked at the probability and lognumber probably plot, lognormal fit is poor again, so highly questionable fit to the data.
	Okay, so now we're going to use a powerful program in JMP called the defective subpopulations fit. Instead of fitting all non negative we're going to fit all DS distributions.
	DS distributions are the defective subpopulation distributions that are using the maximum likelihood equations to estimate, not only the CDF,
	the model, but also the the p value in that model. So we're going to do that. Let's go back over here, okay, into JMP.
	Go back into our example over here.
	Okay, so we're going to come at the top over here, and now, what we're going to do is we're
	going to say to JMP fit all defective subpopulation distributions.
	And now, wow, look at that.
	We now have JMP fitting us with a defective subpopulation lognormal type model.
	And the fit is amazing, I mean, it's it's an excellent fit. And when we go to the nonparametric...the nonparametric scale, we see that we have a model that now works
	and gives us confidence in doing field projections...our projections of the reliability. And if we go down here.
	We showed you before the lognormal parameter estimates, but if we go down here to the DS lognormal,
	the model is now saying that the estimated fraction defective in that population is about 15.8%. Remember we showed at about 15% we had the saturation.
	The scale parameter is .35, and now we have the ability to estimate, you know, using the quantile profiler or the distribution profiler, we can estimate a
	probability to failure at a given point in time, or we can estimate the point in time, in which, you know, a certain probability is reached, using the quantile profiler.
	But the the amazing difference between this and the other model, where we didn't assume that we had a defective subpopulation, is really dramatic.
	So it's it's an important consideration to make sure that when you're looking at reliability data that you try to figure out, do we have a defective subpopulation. Let me go back to my JMP...
	my PowerPoint presentation over here.
	Okay, so when you selected fit all DS distributions, JMP automatically selected the DS lognormal. We took a look at the fit, this fit was very, very good.
	The parameter estimates as shown below and notice that we have this estimate for the parameter estimate for the percent defective in that model, the percent mortals and the T50 of the mortals is shown over here and the Sigma for their mortals is also shown over here.
	Okay now.
	An addition that I'd like to see JMP incorporate into the life distribution is the ability for us to test and see whether the defect model is better
	than the model that assumes that all units can fail. If the ML estimates haven't calculated for a suspect defect model we can test the hypothesis P is equal to one.
	That is, there is no defect subpopulation versus the alternative defect model.
	So here's the test, let L1 be the minimum log likelihood for the standard (non defect) model and let L2 be the minimum log likelihood for the defect model.
	The likelihood ratio test statistic is Lambda =2(L1-L2). If the hypothesis is true, in other words, the model says that all units can fail,
	lambda will have approximately a Chi square distribution with one degree of freedom. If Lambda is larger, say then let's choose a Chi square percentile
	of 95%, then that gives us 3.84 with one degree of freedom, then we would reject the standard model and accept the defect model at a 95% confidence level.
	So let's go back to the output in JMP and JMP actually provides the estimates for the loglikelihoods over here.
	Okay, so I want to show you that in the JMP example over here, you can see that, for the lognormal, here's the loglikelihood, 307, and for the defective subpopulation is 301, so JMP automatically has these values available for you.
	Okay come back to here.
	Okay, and when we put that test statistic in, we say 3.07 minus 3.01 gives us 6.1. Now 6.1 is larger
	than the Chi square value of 3.84, so in this case for that example 9.5, we would reject the standard model that says everything can fail, and we would accept the defect model at 95% confidence.
	Now if you'd like further information on defect models, you can Google defective subpopulations and reliability data for much further information.
	And I've actually got a case study, something when I was working at a company that we ended up solving, and it involved defective subpopulations product down the field, and that's referenced too. And in the reliability literature,
	such models are also called limited failure population (LFP) models.
	So it's important in the analysis of reliability data to recognize and factor in the presence of defective subpopulations (DS models) for unbiased results, and JMP's life distribution platform has the capability to analyze the DS data using both visual and MLE methods.
	The references are shown over here. We talked about defective subpopulations in my book; it's in a third edition, Applied Reliability with Paul Tobias and myself.
	This is the example that I mentioned to you, in which we use defective subpopulations and accelerated testing to solve a problem out in the field. And then Meeker and Escobar also has Statistical Methods for Reliability Data, in which they talk about limited failure population models.
	So I thank you and I'm open for any questions too. Take care.

Presented At Discovery Summit Americas 2021

Presenter

David Trindade

Defect Subpopulation Models (2021-US-30MP-885)

Presenter

Files

Advanced Statistical Modeling

Basic Data Analysis and Modeling

Data Exploration and Visualization

Design of Experiments

Mass Customization

Predictive Modeling and Machine Learning

Quality and Process Engineering

Reliability Analysis