Well, Not Exactly: An Introduction to Censored Data Analysis - (2023-US-PO-1506)

Hi, my name is Michael Crotty.

I'm a statistical writer with the Stat Documentation Team at JMP,

and today, I'm going to talk about an introduction

to censored data analysis in JMP and JMP Pro.

To start, we've got three common types of censoring.

Just to back up a bit, censored data occur

when you don't have an exact measurement for an observation,

but you do know a range for the observation,

so you know not the exact value,

but you do know something about where the value might be.

What we want to do by using censoring in our analyzes

is to use that information that we have, even if it's not exact.

The three types of censoring that we'll talk about today

are right censoring, left censoring, and interval censoring.

Right censoring is probably the most common form of censoring.

It occurs when the event of interest just doesn't have time to occur yet

by the end of the study.

In a reliability test,

you might have a bunch of light bulbs under test

and at the end of the test period, some of them have failed.

Those are exact observations, but then some haven't failed yet.

You know they're going to fail,

but your study has ended, so it's censored at that point.

Same thing in survival models

where a patient survives to the end of the study.

One thing to note is that right censoring is the only type that in JMP,

supports a single response column alongside of a binary censor column.

The next type is left censoring.

That's where the event of interest occurs before the observation starts.

A common example of that would be where you put a bunch of units under test

and at the time that you do the first inspection,

some of them have already failed.

You know that they started without a failure,

but by the time you measured them, you checked on them, they had failed.

So they failed sometime before that point.

Another example of that is limited detection

where you have a measurement tool

that can't measure below a certain threshold.

The last type we'll talk about today is interval censoring.

This is where your event of interest happens between observation time.

If you have a periodic inspection schedule instead of continuous observation,

you might see that something fails or something happens

between time two and three.

It didn't happen at time two and it didn't happen at time three,

but it was somewhere in that interval.

Take a quick look at what this looks like in JMP.

Here's an example of the right censoring

with a response column and a censor column.

In the platforms that support censoring,

you always see this censor role, that's for that binary censoring column.

This is the way that you can do, you can specify censoring more generally,

which is with two response columns.

Basically, it's like a start time and an end time.

For left censoring,

we don't know when it happened, so the start time is missing,

but the end time, we know it happened before time 50,

so somewhere before that.

Reversed that for right censoring, we know that at time 25,

it hadn't happened yet, but it happened sometime after that.

Then with interval,

both the start and endpoints are non-missing,

but we don't know when the event happened in this case between 80 and 150.

It's not shown in the table up here,

but down here, we've got somewhere there's exact censoring.

To specify that,

you just use the same value in both columns.

That means essentially it's like an interval with zero width.

It happened at that exact time.

Next, we're going to talk about two examples of censoring.

The first is if you have censoring in your data,

but maybe you don't know how to handle it,

and so you just think, "I'll just ignore it."

We're going to look at what can possibly happen when you do that.

In this example,

we've got simulated data from a lognormal distribution

and the observed data

that we'll use for analysis in our different cases

is where all the values from the true data that are over 1,900, we set them to 1,900,

as that's the censoring time for it's right censoring.

There are a few possible things you could do

if you're trying to estimate this mean failure time.

You could do nothing.

You could just use this observed data with a whole bunch of values set to 1,900,

act like that's when it happened.

You could treat those as missing values, just drop them from your data,

or you could use the censoring information that you have in your analysis.

For right censoring, these first two approaches

are going to tend to underestimate the mean failure time

because you're dropping information from the data at that far end.

Looking more closely at this, because this is simulated data,

we have the true distribution here in this first column.

That's just for comparison, but in general, you wouldn't have that

because you'd have that all values above 1,900.

You don't know where these fall.

In our observed Y,

this is where we just use all the 1,900s as values of 1,900.

We have no missing values,

but a big point mass at the top of our distribution here.

You can see that the mean is a lot smaller than the true mean.

In this missing Y column, this is where instead of treating them as 1,900,

we drop them.

We set them to missing and analyze the distribution without them.

Here you can see that now our maximum of the non-missing values

is less than 1,900, which really doesn't make any sense

because we know that a bunch of them, 21 observations, in fact,

are some value greater than 1,900.

So this underestimates the mean even more.

Then on the right here,

we've got an analysis in life distribution in JMP.

This is where we're using the observed Y column.

It's got those 1,900s,

but we're also using a censoring column alongside it.

For the rows where observed Y is 1,900,

our censor column is going to say that it's a censored observation.

Here we can see that our mean,

it actually ends up being a little higher than the true mean,

but our lognormal parameter estimates are much closer to the true values

and we're incorporating all the information that we have.

For our next example, we're going to look at detection limits.

This is a limit of detection problem

where we have data on the yield of a pesticide

called Metacrate that's based on levels of some other regression variables.

In this situation,

the measurement system that we have has a lower limit of detection

where it can't measure any yields that are less than 1 %.

So in the data, they're just coded as zeros,

but it really just means it's some yield below 1 %.

There are two ways you could analyze this

incorporating that information in JMP.

The first, you could treat it as left censoring,

use two response columns with the first the left column has a missing value,

and the right column would be a one,

or you can use the detection limits column property

that's new in JMP and JMP Pro.

We'll take a look at this.

Here's a subset of the data.

This Metacrate Reading column is the same as the original reading column,

but it's got a detection limits column property.

Because this is a lower detection limit

where we can't measure any lower than that limit,

we're going to set the lower detection limit to one.

The other way you could do this is with the two columns.

In this case, we know that it's left censoring,

so the left side is missing and the upper side of that is one,

just means that the value is somewhere less than one.

That's all we know.

But as you can see from the column information window down here,

the detection limits column property is recognized by the distribution

and generalized regression platform.

So this is a regression problem.

We'll use generalized regression in JMP Pro.

Here we fit a lognormal response distribution,

and it's able to do that on this Metacrate reading column,

even with the zeros in there,

because GenReg's not treating those observations as zeros,

it's treating them as values censored at one.

Now, we were able to use all the information

and get a regression model.

In conclusion, probably, the most important thing is

when you have censoring information,

it's better to use it in your analysis than to ignore it.

Censoring can occur a lot of times for time responses,

but it can also occur for other responses.

A good example of that is these limited detection problems.

Finally, you can use the following approaches

to specify censoring in JMP.

There's the two-column approach that's probably the most flexible

because that allows you to do right censoring, left censoring,

interval censoring, as well as a mix of all three of those.

For right censoring, you can use the one column response

paired with a binary indicator column for censoring.

There's also this new column property in JMP for detection limits

where you can set a limit of detection either on the low side or the high side.

We've got a few references here if you're interested in more information.

One of those is a Discovery talk I did in 2017

that's got more of the background of how the censoring information is used

in the calculations of these analyzes.

That's it. Thank you.