Showing results for 
Show  only  | Search instead for 
Did you mean: 
Choose Language Hide Translation Bar
Well, Not Exactly: An Introduction to Censored Data Analysis - (2023-US-PO-1506)

Michael Crotty, Principal Statistical Writer, JMP


There are many times when an exact measurement is not possible, but a range of values for the measurement is available. Censored data analysis methods enable you to incorporate the information from both types of measurements. This presentation provides an introduction to censored data situations: when they appear, how to handle them, and what happens when you do not handle them appropriately. This presentation includes examples of censoring in univariate and regression settings by using the Life Distribution and Generalized Regression platforms in JMP and JMP Pro 17, including the new Detection Limits column property.



Hi,  my  name  is  Michael  Crotty.


I'm  a  statistical  writer with  the  Stat  Documentation  Team  at  JMP,

and  today, I'm  going  to  talk  about  an  introduction

to  censored  data  analysis in  JMP  and  JMP  Pro.

To  start,  we've  got three  common  types  of  censoring.

Just  to  back  up  a  bit,  censored  data  occur

when  you  don't  have  an  exact  measurement for  an  observation,

but  you  do  know a  range  for  the  observation,

so  you  know  not  the  exact  value,

but  you  do  know  something about  where  the  value  might  be.

What  we  want  to  do by  using  censoring  in  our  analyzes

is  to  use  that  information  that  we  have, even  if  it's  not  exact.

The  three  types  of  censoring that  we'll  talk  about  today

are  right  censoring,  left  censoring, and  interval  censoring.

Right  censoring  is  probably the  most  common  form  of  censoring.

It  occurs  when  the  event  of  interest just  doesn't  have  time  to  occur  yet

by  the  end  of  the  study.

In  a  reliability  test,

you  might  have a  bunch  of  light  bulbs   under test

and  at  the  end  of  the  test  period, some  of  them  have  failed.

Those  are  exact  observations, but  then  some  haven't  failed  yet.

You  know  they're  going  to  fail,

but  your  study  has  ended, so  it's  censored  at  that  point.

Same  thing  in  survival  models

where  a  patient  survives to  the  end  of  the  study.

One  thing  to  note  is  that  right  censoring is  the  only  type  that  in  JMP,

supports  a  single  response  column alongside  of  a  binary  censor  column.

The  next  type  is  left  censoring.

That's  where  the  event  of  interest  occurs before  the  observation  starts.

A  common  example  of  that  would  be where  you  put  a  bunch  of  units  under  test

and  at  the  time that  you  do  the  first  inspection,

some  of  them  have  already  failed.

You  know  that they  started  without  a  failure,

but  by  the  time  you  measured  them, you  checked  on  them,  they  had  failed.

So  they  failed  sometime  before  that  point.

Another  example  of  that is  limited  detection

where  you  have  a  measurement  tool

that  can't  measure below  a  certain  threshold.

The  last  type  we'll  talk  about  today is  interval  censoring.

This  is  where  your  event  of  interest happens  between  observation  time.

If  you  have  a  periodic  inspection  schedule instead  of  continuous  observation,

you  might  see  that  something  fails or  something  happens

between  time  two  and  three.

It  didn't  happen  at  time  two and  it  didn't  happen  at  time  three,

but  it  was  somewhere  in  that  interval.

Take  a  quick  look at  what  this  looks  like  in  JMP.

Here's  an  example  of  the  right  censoring

with  a  response  column and  a  censor  column.

In  the  platforms  that  support  censoring,

you  always  see  this  censor  role, that's  for  that  binary  censoring  column.

This  is  the  way  that  you  can  do, you  can  specify  censoring  more  generally,

which  is  with  two  response  columns.

Basically, it's  like  a  start  time  and  an  end  time.

For  left  censoring,

we  don't  know  when  it  happened, so  the  start  time  is  missing,

but  the  end  time, we  know  it  happened  before  time  50,

so  somewhere  before  that.

Reversed  that  for  right  censoring, we  know  that  at  time  25,

it  hadn't  happened  yet, but  it  happened  sometime  after  that.

Then  with  interval,

both  the  start  and  endpoints are  non-missing,

but  we  don't  know  when  the  event  happened in  this  case  between  80  and  150.

It's  not  shown  in  the  table  up  here,

but  down  here,  we've  got  somewhere there's  exact  censoring.

To  specify  that,

you  just  use  the  same  value in  both  columns.

That  means  essentially it's  like  an  interval  with  zero  width.

It  happened  at  that  exact  time.

Next,  we're  going  to  talk about  two  examples  of  censoring.

The  first  is if  you  have  censoring  in  your  data,

but  maybe  you  don't  know  how  to  handle  it,

and  so  you  just  think, "I'll  just  ignore  it."

We're  going  to  look  at  what  can possibly  happen  when  you  do  that.

In  this  example,

we've  got  simulated  data from  a  lognormal  distribution

and  the  observed  data

that  we'll  use  for  analysis in  our  different  cases

is  where  all  the  values  from  the  true  data that  are  over  1,900,  we  set  them  to  1,900,

as that's  the  censoring  time for  it's  right  censoring.

There  are  a  few  possible  things you  could  do

if  you're  trying  to  estimate this  mean  failure  time.

You  could  do  nothing.

You  could  just  use  this  observed  data with  a  whole  bunch  of  values  set  to  1,900,

act  like  that's  when  it  happened.

You  could  treat  those  as  missing  values, just  drop  them  from  your  data,

or  you  could  use  the  censoring  information that  you  have  in  your  analysis.

For  right  censoring, these  first  two  approaches

are  going  to  tend  to  underestimate the  mean  failure  time

because  you're  dropping  information from  the  data  at  that  far  end.

Looking  more  closely  at  this, because  this  is  simulated  data,

we  have  the  true  distribution here  in  this  first  column.

That's  just  for  comparison, but  in  general,  you  wouldn't  have  that

because  you'd  have that  all  values  above  1,900.

You  don't  know  where  these  fall.

In  our  observed  Y,

this  is  where  we  just  use all  the  1,900s  as  values  of  1,900.

We  have  no  missing  values,

but  a  big  point  mass at  the  top  of  our  distribution  here.

You  can  see  that  the  mean is  a  lot  smaller  than  the  true  mean.

In  this  missing  Y  column,  this  is where  instead  of  treating  them  as  1,900,

we  drop  them.

We  set  them  to  missing and  analyze  the  distribution  without  them.

Here  you  can  see  that now  our  maximum  of  the  non-missing  values

is  less  than  1,900, which  really  doesn't  make  any  sense

because  we  know  that  a  bunch  of  them, 21  observations,  in  fact,

are  some  value  greater  than  1,900.

So this  underestimates  the  mean  even  more.

Then  on  the  right  here,

we've  got  an  analysis  in  life  distribution in  JMP.

This  is  where  we're  using the  observed  Y  column.

It's  got  those  1,900s,

but  we're  also  using  a  censoring  column alongside  it.

For  the  rows  where  observed  Y  is  1,900,

our  censor  column  is  going  to  say that  it's  a  censored  observation.

Here  we  can  see  that  our  mean,

it  actually  ends  up being  a  little  higher  than  the  true  mean,

but  our  lognormal  parameter  estimates are  much  closer  to  the  true  values

and  we're  incorporating all  the  information  that  we  have.

For  our  next  example, we're  going  to  look  at  detection  limits.

This  is  a  limit  of  detection  problem

where  we  have  data on  the  yield  of  a  pesticide

called  Metacrate that's  based  on  levels of  some  other  regression  variables.

In  this  situation,

the  measurement  system  that  we  have has  a  lower  limit  of  detection

where  it  can't  measure any  yields  that  are  less  than  1 %.

So  in  the  data, they're  just  coded  as  zeros,

but  it  really  just  means it's  some  yield  below  1 %.

There  are  two  ways you  could  analyze  this

incorporating  that  information  in  JMP.

The  first, you  could  treat  it  as  left  censoring,

use  two  response  columns  with  the  first the  left  column  has  a  missing  value,

and  the  right  column  would  be  a  one,

or  you  can  use the  detection  limits  column  property

that's  new  in  JMP  and  JMP  Pro.

We'll  take  a  look  at  this.

Here's  a  subset  of  the  data.

This  Metacrate  Reading  column  is the  same  as  the  original  reading  column,

but  it's  got a  detection  limits  column  property.

Because  this  is  a  lower  detection  limit

where  we  can't  measure any  lower  than  that  limit,

we're  going  to  set the  lower  detection  limit  to  one.

The  other  way  you  could  do  this is  with  the  two  columns.

In  this  case, we  know  that  it's  left  censoring,

so  the  left  side  is  missing and  the  upper  side  of  that  is  one,

just  means  that  the  value is  somewhere  less  than  one.

That's  all  we  know.

But  as  you  can  see  from the  column  information  window  down  here,

the  detection  limits  column  property is  recognized  by  the  distribution

and  generalized  regression  platform.

So  this  is  a  regression  problem.

We'll  use  generalized  regression in  JMP  Pro.

Here  we  fit   a  lognormal  response  distribution,

and  it's  able  to  do  that on  this  Metacrate  reading  column,

even  with  the  zeros  in  there,

because   GenReg's  not  treating those  observations  as  zeros,

it's  treating  them as  values  censored  at  one.

Now,  we  were  able  to  use all  the  information

and  get  a  regression  model.

In  conclusion,  probably, the  most  important  thing  is

when  you  have  censoring  information,

it's  better  to  use  it  in  your  analysis than  to  ignore  it.

Censoring  can  occur  a  lot  of  times for  time  responses,

but  it  can  also  occur  for  other  responses.

A  good  example  of  that is  these  limited  detection  problems.

Finally,  you  can  use the  following  approaches

to  specify  censoring  in  JMP.

There's  the  two-column  approach that's  probably  the  most  flexible

because  that  allows  you to  do  right  censoring,  left  censoring,

interval  censoring, as  well  as  a  mix  of  all  three  of  those.

For  right  censoring, you  can  use  the  one  column  response

paired  with  a  binary  indicator  column for  censoring.

There's  also  this  new  column  property in  JMP  for  detection  limits

where  you  can  set  a  limit  of  detection either  on  the  low  side  or  the  high  side.

We've  got  a  few  references  here if  you're  interested  in  more  information.

One  of  those is  a  Discovery  talk  I  did  in  2017

that's  got  more  of  the  background of  how  the  censoring  information  is  used

in  the  calculations  of  these  analyzes.

That's  it.  Thank  you.