Hi, my name is Clay Barker, and I'm a statistical developer in the JMP group.
I'm going to be talking about some new features in the distribution platform
geared towards analyzing limit of detection data.
It's something I worked on with my colleague, Laura Lancaster,
who's also a stat developer at JMP.
To kick things off, what is a limited detection problem?
What are we trying to solve?
It is basic level,
a limit of detection problem is when we have some measurement device
and it's not able to provide good measurements outside of some threshold.
That's what we call a limit of detection problem.
F or example, let's say we have, we're taking weight measurements
and we're using a scale that's not able to make measurements below 1g .
In this instance, we'd say that we have a lower detection limit of 1
because we can't measure below that.
But in our data set, we're still recording those values as 1.
Because of those limitations,
we see, we might have a data set that looks like this.
In part, we have some values of 1 and some non-1 values that are much bigger.
We don't really believe that those values are 1s.
We just know that those are at most 1.
This data happens all the time in practice.
If you think about sometimes we're not able to measure below a lower threshold.
Sometimes we're not able to measure above an upper threshold.
Those are both limited detection problems.
What should we do about that?
Let's look at a really simple example.
I simulated some data that are normally distributed
with mean 10 and variance 1 ,
and we're imposing a lower detection limit of 9.
If we look at our data set here,
we have some values above 9, and we have some 9s.
When we look at the histogram,
this definitely doesn't look normally distributed
because we have a whole bunch of extra 9s
and we don't have anything below 9.
What happens if we just model that data as is?
Well, it turns out the results aren't that great.
We get a decent estimate of our location parameter, our Mu,
it's really close to 10, which is the truth.
But we've really underestimated that scale or dispersion parameter.
We've estimated it to be 0.8,
when the truth is that we generated it with scale equal to 1.
You'll notice that our confidence interval for that scale parameter
doesn't even cover 1.
It doesn't contain the truth and that's generally not a great situation to be in.
What's more, if we look at the… We fit a handful of distributions,
we fit the log normal, the gamma, and the normal,
well the normal distribution, which is what we use to generate our data,
it isn't even competitive based on the AIC.
Based on those AIC values,
we would definitely choose a log normal distribution to model our response.
We haven't done a good job estimating our parameters.
We're not even choosing to use the distribution
that we generated the data with.
W e just threw all those 9s into our data set.
We ignored the fact that that was incomplete information
and that didn't work out well.
What if, instead of ignoring that limit of detection,
what if we just throw out all those times?
Well, now we've got a smaller data set and it's biased.
We've thrown out a large chunk of our data on it.
We have a biased sample now.
Now if we fit our normal distribution,
now we're overestimating the location parameter,
and we're still underest imating the scale parameter.
We're actually in quite a bad position still,
because we haven't done a good job with either of those parameters.
We're still unlikely to pick the normal distribution.
Based on the AIC, the log normal and the gamma distribution
both fit quite a bit better than the normal distribution.
We're still in a bad place.
We tried throwing out the 9 s, and that didn't turn out well.
We tried just including them as 9 s.
That didn't turn out well either.
What should we do instead?
The answer is that we should treat
those observations at the limit of detection.
We should treat those as censored observations.
Censoring is a situation where
we only have partial information about our response variable.
That's exactly the situation we're in here.
If we have an observation at the lower detection limit, and here I've denoted it
D sub L,
we say that observation is less censored.
We don't say that Y is equal to that limit of detection.
We say that Y is less than or equal to that DL value.
On the flip side,
if we have a upper limit of detection, denoted DU here,
those observations are right censored.
Because we're not saying that Y is equal to that value.
We're just saying it's at least that value.
If you're looking for more information about how to handle censored data,
one of the references that we suggest all the time is
Meeker and Escobar's book Statistical Methods for Reliability Data.
That's a really good overview for how you should treat censored data.
If you've used some of the...
If you use some of the features and the survival and reliability menu in JMP,
then you're familiar with things like life distribution and fit life by X.
These are all platforms that accommodate censoring in JMP.
What we're excited about in JMP 17 is now
we have added some features to distribution so that we can handle
this limit of detection problem and distribution as well.
All you have to do is you add
a detection limit column property to your response variable,
and you specify what the upper and or lower detection limit is,
and you're good to go, there's nothing else you have to do.
In my simulated example, I had a lower detection limit of 9.
I would put 9 in the lower detection limit field here.
That's really all you have to do.
By specifying that detection limit,
now distribution is going to say, okay, I know that values of 9
are actually left censored,
and I'm going to do estimation accounting for that.
Now with that same simulated example, and this lower detection limit specified,
now you'll notice we get a much more reasonable fit the normal distribution.
Now our confidence interval for both the location and scale parameter
covers the truth,
because we know, again, the location was 10 and the scale was 1.
Now our confidence intervals cover the truth
and that's a much better situation.
If you look at the CDF plot here,
this is a really good way to compare our fitted distribution to our data.
What it's doing is that red line is the empirical CDF,
and the green line is the fitted normal CDF.
as you can tell, they're really close up until 9.
And that makes sense, because that's where we have censoring.
We're doing a much better job fitting these data
because we're properly handling that detection limit.
I just wanted to point out that when you've specified the detection limit,
the report makes it really clear that we've used it.
As you can see here,
it says the fitted normal distribution with detection limits,
and it lets you know exactly which detection limits it used.
Now not only are,
because we're doing a better job estimating our parameters,
things like inference about those parameters is more trustworthy.
If we do something like we look at the distribution profiler
now we can trust these
inference based on our fitted distribution,
we feel much better about trusting things like the distribution profiler.
With the simulated example, if we use our fitted normal distribution,
Because we properly handled censoring,
we know that about 16 % of the observations
are falling below that lower detection limit.
I also wanted to point out that
when you have detection limits in distribution,
now we're only able to fit a subset of the distributions
that you would normally see in the distribution platform.
We can fit the normal, exponential, gamma log, normal, WI and beta.
All of those distributions support censoring
or limited detection in distribution.
But if you were using something like the mixture of normals,
well, that that doesn't extend well to sensor data.
You're not going to be able to fit that when you have a limit of detection.
I also wanted to point out if you have JMP pro
and you're used to using the generalized regression platform,
generalized regression recognizes that exact same column property.
The detection limit column property
is recognized by both distribution and generalized .
One of the really nice things about
this new feature is that it gets carried on to the capability platform.
If you do your fit and distribution, and you launch capability,
now we're going to get more trustworthy capability results.
Let's say that we're manufacturing a new drug,
and we want to measure the amount of sum impurity in the drug.
Our data might look like what we have here.
We have a bunch of small values, and we have a lower detection limit of 1 mg.
these values of 1 that are highlighted here,
we don't really think those are 1.
We actually think it's something... We know that it's something less than 1 .
We have an upper specification limit of 2.5 milligrams.
this is a situation where we have both spec limits and detection limits.
It's really easy to specify those in the column properties.
Here we've defined our upper spec limit as 2.5
And our lower detection limit of 1.
Now all you have to do is just
give distribution the column that you want to analyze.
It knows exactly how to handle that response variable.
Let's look at the capability results.
Now, because we've properly handled that limit of detection,
we trust that our log normal fit is good.
We see that our Ppk, value here is 0.546 .
That's not very good.
Usually you would want a Ppk above 1.
This is telling us that our system is not very capable.
We've got some problems that we might need to sort out.
Once again, what would have happened if we had ignored that limit of detection
and we had just treated all those 1s as if they truly were 1s.
Well, let's take a look.
We do our fit, ignoring the limit of detection, and we get a Ppk of above 1.
Based on this fit,
we would say that we actually have a decently capable system,
because a Ppk of 1 is not too bad.
It might be an acceptable value.
By ignoring that limit of detection,
we've tricked ourselves into thinking our system is more capable than it really is.
I think this is a cool example,
because we have a lower detection limit, which may lead you to believe,
well, I might be maybe ignoring the limit of detection would be conservative,
because I'm overestimating the location parameter.
That's true, when we ignore the limit of detection,
we're overestimating that location parameter.
But the problem is we're grossly underestimating the scale parameter.
That's what makes us make bad decisions out in the tail
of that distribution.
By ignoring that limit of detection,
we've really gotten ourselves into a bad situation.
Just to summarize,
it's really important to recognize when our data feature a limit of detection.
I think it's easy to think of,
sometimes we think about data sets where
maybe we've analyzed the response as is in the past,
when really, maybe we should have adjusted for a limit of detection.
Because like we just saw, when we ignore those limits, we get misleading fits.
Then we may make bad decisions based on those misleading fits.
Like we saw in our example,
we got bad estimates or the location and scale parameters,
and our Ppk estimate was almost double what it should have been.
But what we're excited about in JMP 17
is that the distribution platform makes it really easy to avoid these pitfalls
and to analyze this kind of data properly.
All you have to do is specify that detection limit column property,
and distribution knows exactly what to do with that.
Today we only looked at lower detection limits,
but you can just as well have upper detection limits as well.
In fact, you can have both.
Like I said, there's only six distributions that currently support
censoring inside of the distribution platform.
But those are also the six most important distributions for these kinds of data.
It really is a good selection of distributions.
That's it.
I just wanted to thank you for your time
and encourage you to check out these enhancements to the distribution platform.
Thanks.