Choose Language Hide Translation Bar
Staff

## Introducing Limits of Detection in the Distribution Platform (2022-US-EPO-1164)

Clay Barker, Principal Research Statistician Developer, SAS
Laura Lancaster, Principal Research Statistician Developer, JMP

When taking measurements, sometimes we are unable to reliably measure above or below a particular threshold. For example, we may be weighing items on a scale that is known to only be able to measure as low as 10 grams. This kind of threshold is known as a "Limit of Detection" and is important to incorporate into our data analysis. This poster will highlight new features in the Distribution platform for JMP 17 that make it easier to analyze data that feature detection limits.  We will highlight the importance of recognizing detection limits when analyzing process capability and show how ignoring detection limits can cause a quality analyst to make incorrect conclusions about the quality of their processes.

Hi,  my  name  is  Clay  Barker,  and  I'm  a statistical  developer  in  the  JMP  group.

I'm  going  to  be  talking  about  some  new features  in  the  distribution  platform

geared  towards  analyzing limit  of  detection  data.

It's  something  I  worked  on  with  my  colleague,  Laura  Lancaster,

who's  also  a  stat  developer  at  JMP.

To  kick  things  off, what  is  a  limited  detection  problem?

What  are  we  trying  to  solve?

It  is  basic  level,

a  limit  of  detection  problem  is  when   we  have  some  measurement  device

and  it's  not  able  to  provide  good measurements  outside  of  some  threshold.

That's  what  we  call a  limit  of  detection  problem.

F or  example,  let's  say  we  have, we're  taking  weight  measurements

and  we're  using  a  scale  that's  not  able to  make  measurements  below  1g .

In  this  instance,  we'd  say  that  we  have a lower  detection  limit  of  1

because we  can't  measure  below  that.

But  in  our  data  set,  we're  still recording  those  values  as  1.

Because  of  those  limitations,

we  see,  we  might  have  a  data  set that  looks  like  this.

In  part,  we  have  some  values  of  1  and some  non-1  values  that  are  much  bigger.

We  don't  really  believe that  those  values  are  1s.

We  just  know  that  those  are  at  most  1.

This  data  happens  all the  time  in  practice.

If  you  think  about  sometimes  we're  not able  to  measure  below  a  lower  threshold.

Sometimes  we're  not  able  to  measure above  an  upper  threshold.

Those  are  both  limited  detection  problems.

What  should  we  do  about  that?

Let's  look  at  a  really  simple  example.

I  simulated  some  data that  are  normally  distributed

with  mean  10  and  variance 1 ,

and  we're  imposing  a  lower detection  limit  of  9.

If  we  look  at  our  data  set  here,

we  have  some  values  above  9, and  we  have  some  9s.

When  we  look  at  the  histogram,

this  definitely  doesn't look  normally  distributed

because  we  have a  whole  bunch  of  extra  9s

and  we  don't have  anything  below  9.

What  happens  if  we  just model  that  data  as  is?

Well,  it  turns  out  the  results aren't  that  great.

We  get  a  decent  estimate of  our  location  parameter,  our  Mu,

it's  really  close to  10,  which  is  the  truth.

But  we've  really  underestimated that  scale  or  dispersion  parameter.

We've  estimated  it  to  be  0.8,

when  the  truth  is  that  we  generated it  with  scale  equal  to  1.

You'll  notice  that  our  confidence interval  for  that  scale  parameter

doesn't  even  cover  1.

It  doesn't  contain  the  truth  and  that's generally  not  a  great  situation  to  be  in.

What's  more,  if  we  look  at  the… We  fit  a  handful  of  distributions,

we  fit  the  log  normal, the  gamma,  and  the  normal,

well the  normal  distribution,  which  is what  we  use  to  generate  our  data,

it  isn't  even  competitive based  on  the  AIC.

Based  on  those  AIC  values,

we  would  definitely  choose  a  log  normal distribution  to  model  our  response.

We  haven't  done  a  good  job estimating  our  parameters.

We're  not  even  choosing to  use  the  distribution

that  we  generated  the  data  with.

W e  just  threw all  those  9s  into  our  data  set.

We  ignored  the  fact  that that was  incomplete  information

and that  didn't  work  out  well.

What  if,  instead  of  ignoring that  limit  of  detection,

what  if  we  just throw  out  all  those  times?

Well,  now  we've  got  a  smaller data  set  and  it's  biased.

We've  thrown  out  a  large chunk  of  our  data  on  it.

We  have  a  biased  sample  now.

Now  if  we  fit  our  normal  distribution,

now  we're  overestimating the  location  parameter,

and  we're  still  underest imating the  scale  parameter.

We're  actually  in  quite a  bad  position  still,

because  we  haven't  done  a  good job  with  either  of  those  parameters.

We're  still  unlikely  to  pick the normal  distribution.

Based  on  the  AIC, the  log  normal  and  the  gamma  distribution

both  fit  quite  a  bit  better than  the  normal  distribution.

We're  still  in  a  bad  place.

We  tried  throwing  out  the  9 s, and  that  didn't  turn  out  well.

We  tried  just  including  them  as  9 s.

That  didn't  turn  out  well  either.

The  answer  is  that  we  should  treat

those  observations at  the  limit  of  detection.

We  should  treat  those  as censored  observations.

Censoring  is  a  situation  where

we  only  have  partial  information about  our  response  variable.

That's  exactly the  situation  we're  in  here.

If  we  have  an  observation  at  the  lower detection  limit,  and  here  I've  denoted  it

D sub L,

we  say  that  observation is  less  censored.

We  don't  say  that  Y  is  equal to  that  limit  of  detection.

We  say  that  Y  is  less  than or  equal  to  that  DL  value.

On  the  flip  side,

if  we  have  a  upper  limit  of  detection, denoted  DU  here,

those  observations  are  right  censored.

Because  we're  not saying  that  Y  is  equal  to  that  value.

We're  just  saying  it's at  least  that  value.

If  you're  looking  for  more  information about  how  to  handle  censored  data,

one  of  the  references that  we  suggest  all  the  time  is

Meeker and  Escobar's  book Statistical  Methods  for  Reliability  Data.

That's  a  really  good  overview for  how  you  should  treat  censored  data.

If   you've  used  some  of  the...

If  you  use  some  of  the  features  and the  survival  and  reliability  menu  in JMP,

then  you're  familiar  with  things like  life  distribution  and  fit  life  by  X.

These  are  all  platforms that  accommodate  censoring  in  JMP.

What  we're  excited  about  in  JMP  17  is  now

we  have  added  some  features to  distribution  so  that  we  can  handle

this  limit  of  detection  problem and distribution  as  well.

All  you  have  to  do  is  you  add

a  detection  limit  column  property to  your  response  variable,

and  you  specify  what  the  upper and  or  lower  detection  limit  is,

and  you're  good  to  go, there's  nothing  else  you  have  to  do.

In  my  simulated  example, I  had  a  lower  detection  limit  of  9.

I  would  put  9  in  the  lower detection  limit  field  here.

That's  really  all  you  have  to  do.

By  specifying  that  detection  limit,

now  distribution  is  going  to  say,   okay,  I  know  that  values  of  9

are  actually  left  censored,

and  I'm  going  to  do estimation  accounting  for  that.

Now  with  that  same  simulated  example, and  this  lower  detection  limit  specified,

now  you'll  notice  we  get  a  much  more reasonable  fit  the  normal  distribution.

Now  our  confidence  interval  for  both the  location  and  scale  parameter

covers  the  truth,

because  we  know,  again,  the location  was  10  and  the  scale  was  1.

Now  our  confidence intervals  cover  the  truth

and that's  a  much  better  situation.

If  you  look  at  the  CDF  plot  here,

this  is  a  really  good  way  to  compare our  fitted  distribution  to  our  data.

What  it's  doing  is  that  red  line is  the  empirical  CDF,

and  the  green  line  is  the  fitted  normal  CDF.

as  you  can  tell,  they're really  close  up  until  9.

And that  makes  sense,  because that's  where  we  have  censoring.

We're  doing a  much  better  job  fitting  these  data

because  we're  properly handling  that  detection  limit.

I  just  wanted  to  point  out  that  when you've  specified  the  detection  limit,

the  report  makes  it  really clear  that  we've  used  it.

As  you  can  see  here,

it  says  the  fitted  normal  distribution with  detection  limits,

and  it  lets  you  know  exactly which  detection  limits  it  used.

Now  not  only  are,

because  we're  doing  a  better  job estimating  our  parameters,

things  like  inference  about  those parameters  is  more  trustworthy.

If  we  do  something  like we  look  at  the  distribution  profiler

now  we  can  trust  these

inference  based on  our  fitted  distribution,

we  feel  much  better  about  trusting  things like  the  distribution  profiler.

With  the  simulated  example, if  we  use  our  fitted  normal  distribution,

Because  we  properly   handled censoring,

we  know  that  about  16 % of  the  observations

are  falling below  that  lower  detection  limit.

I  also  wanted  to  point  out  that

when  you  have detection  limits  in  distribution,

now  we're  only  able  to  fit a  subset  of  the  distributions

that  you  would  normally see  in  the  distribution  platform.

We  can  fit  the  normal,  exponential, gamma  log,  normal,  WI  and  beta.

All  of  those  distributions support  censoring

or  limited detection  in  distribution.

But  if  you  were  using  something  like the  mixture  of  normals,

well,  that   that  doesn't extend  well  to  sensor  data.

You're  not  going  to  be  able  to  fit  that   when  you  have  a  limit  of  detection.

I  also  wanted  to  point  out if  you  have  JMP  pro

and  you're  used  to  using the  generalized  regression  platform,

generalized  regression  recognizes that  exact  same  column  property.

The  detection  limit  column  property

is  recognized  by  both distribution  and  generalized .

One  of  the  really  nice  things  about

this  new  feature  is  that  it  gets  carried on  to  the  capability  platform.

If  you  do  your  fit  and  distribution, and  you  launch  capability,

now  we're  going  to  get  more trustworthy  capability  results.

Let's  say  that  we're manufacturing  a  new  drug,

and  we  want  to  measure the  amount  of  sum  impurity  in  the  drug.

Our  data  might  look like  what  we  have  here.

We  have  a  bunch  of  small  values,  and  we have  a  lower  detection  limit  of  1  mg.

these  values  of  1 that  are  highlighted  here,

we  don't really  think  those  are  1.

We  actually  think  it's  something... We  know  that  it's  something  less  than 1 .

We  have  an  upper  specification limit  of  2.5  milligrams.

this  is  a  situation  where  we  have both  spec  limits  and  detection  limits.

It's  really  easy  to  specify those  in  the  column  properties.

Here  we've  defined  our  upper  spec  limit as  2.5

And  our  lower detection  limit  of  1.

Now  all  you  have  to  do  is  just

give  distribution the  column  that  you  want  to  analyze.

It  knows  exactly  how  to  handle  that response  variable.

Let's  look  at  the  capability  results.

Now,  because  we've  properly handled  that  limit  of  detection,

we  trust  that  our  log  normal  fit  is  good.

We  see  that  our  Ppk, value  here  is  0.546 .

That's  not  very  good.

Usually  you  would  want  a  Ppk  above  1.

This  is  telling  us  that our  system  is  not  very  capable.

We've  got  some  problems  that  we might  need  to  sort  out.

Once  again,  what  would  have  happened  if we  had  ignored  that  limit  of  detection

and  we  had  just  treated  all  those 1s  as  if  they  truly  were  1s.

Well,  let's  take  a  look.

We  do  our  fit,  ignoring  the  limit  of detection,  and  we  get  a  Ppk  of  above  1.

Based on  this  fit,

we  would  say  that  we  actually  have a  decently  capable  system,

because  a  Ppk  of  1  is  not  too  bad.

It  might  be  an  acceptable  value.

By  ignoring  that  limit  of  detection,

we've  tricked  ourselves  into  thinking  our system  is  more  capable  than  it  really  is.

I  think  this  is  a  cool  example,

because  we  have  a  lower  detection  limit, which  may  lead  you  to  believe,

well,  I  might  be  maybe  ignoring  the  limit of  detection  would  be  conservative,

because  I'm  overestimating the  location  parameter.

That's  true, when  we  ignore  the  limit  of  detection,

we're  overestimating that  location  parameter.

But  the  problem  is  we're  grossly underestimating  the  scale  parameter.

That's  what  makes  us  make  bad decisions  out  in  the  tail

of  that  distribution.

By  ignoring  that  limit  of  detection,

we've  really  gotten  ourselves into  a  bad  situation.

Just  to  summarize,

it's  really  important  to  recognize  when our  data  feature  a  limit  of  detection.

I  think  it's  easy  to  think  of,

sometimes  we  think  about  data  sets  where

maybe  we've  analyzed the  response  as  is  in  the  past,

when  really,  maybe  we  should  have adjusted  for  a  limit  of  detection.

Because  like  we  just  saw,  when  we  ignore those  limits,  we  get  misleading  fits.

Like  we  saw  in  our  example,

we  got  bad  estimates or  the  location  and  scale  parameters,

and  our  Ppk  estimate  was almost  double  what  it  should  have  been.

But  what  we're  excited  about  in  JMP  17

is  that  the  distribution  platform  makes  it really  easy  to  avoid  these  pitfalls

and  to  analyze  this  kind of data  properly.

All  you  have  to  do  is  specify that  detection  limit  column  property,

and  distribution  knows exactly  what  to  do  with  that.

Today  we  only  looked  at lower  detection  limits,

but  you  can  just  as  well have  upper  detection  limits  as  well.

In  fact,  you  can  have  both.

Like  I  said,  there's  only six  distributions  that  currently  support

censoring inside  of  the  distribution  platform.

But  those  are  also  the  six  most  important distributions  for  these  kinds  of  data.

It  really  is a  good selection  of  distributions.

That's  it.

I just wanted  to  thank  you  for  your  time

and  encourage  you  to  check  out  these enhancements  to  the  distribution  platform.

Thanks.