Sepsis Prediction from Clinical Data Using JMP Pro 16 (2022-EU-45MP-1010)

Stanley Siranovich, Principal Analyst, Crucial Connection LLC

Sepsis is a life-threatening condition which occurs when the body's response to infection causes tissue damage, organ failure, or death. In fact, sepsis costs U.S. hospitals more than any other health condition, and a majority of these costs is for sepsis patients who were not diagnosed at admission. Thus, early detection and treatment are critical for improving outcomes.

This presentation examines an actual clinical data set, obtained from two U.S. hospitals and recently published on Kaggle. In particular, a number of predictors, drawn from a combination of vital signs, demographic groups, and clinical laboratory data are examined. Using JMP, such issues as missing values, outliers, and a highly unbalanced, categorical outcome variable are dealt with. In addition, this presentation shows how visualization, interactivity, and analytical flow can lead to a more compact and integrated analysis — and a shorter time to discovery.

Good morning. Good afternoon.

Good evening, everyone.

My name is Stan Saranovich,

and I am the principal analyst at Crucial Connection, LLC.

And I am located in Jeffersonville, Indiana, right across the river

from Louisville, Kentucky, United States of America.

And today I'm going to talk about sepsis

predictions from clinical data using JMP Pro 16.

But almost all of what I am going to do

here will be available in the standard version of JMP.

Now let's talk about sepsis for a minute.

To start off, sepsis is a life threatening condition which occurs when the body's

response to infection causes tissue damage and cause organ failure or even death.

In fact, sepsis costs United States

hospital more than any other health condition.

So if we could predict sepsis and detect it early, we could improve the outcome

of critical care patients and also lower the cost of health care.

So we're going to look at a data set

today, and this is an actual data set of clinical data.

It was collected from two hospitals

in Boston, Massachusetts, area in the United States.

And these two data tables were published as a contest on Kaggle by a Cardiology

group, and the results were eventually published in the Cardiology Journal.

Now, there were three units involved

in this study, and in this context, I have the data for what we'll call unit

one and unit two, which is data from two ICU intensive care units.

The third group of data was not made publicly available and was held back

for the contest, and it's still not publicly available.

So what we'll do now is examine the data and see if we can predict

sepsis and what variables we should be following to avoid sepsis, which,

of course, can be a life threatening condition.

Now I have a data set in front of me,

and this was downloaded from the Kaggle site and imported into JMP.

And let's take a closer look at it.

Usually I like to JMP right in to the data analysis,

but in this particular case, that's not going to be a good idea.

And after we look over the data, you'll see why.

First of all, let's look over at the left

of the JMP data table, and we have the columns window.

I prefer to do a lot of the work from here

and here we see a number of variables.

So let's start with heart rate

right here.

That's going to be an important predictor, probably, but we don't know that.

And we have the rest of the predictors over here.

And there are actually 40 of them.

Well, no, 38, if you don't count the units, two saturation, et cetera.

And we could go through the list right here temperature.

But what you'll notice and for this, I'm going to have to scroll

is that the first six or eight columns are just clinical data.

We have systolic blood pressure.

We have respirations diastolic blood pressure.

And as we go across our data table, we see some lab data.

We're talking about glucose and lactate

levels, magnesiums, phosphate, potassium, bilirubin, et cetera, et cetera.

And finally, we have a set of columns,

which I guess we could call it demographic data.

We have the age right here, gender, and what unit they belong to.

Now, while we're on units, let's take a look here.

We have unit one and unit two.

And we know from doing our background

research and we all do background research.

Right before we start the analysis,

we know that one is a Cardiology unit and the other one is a surgery unit.

So if they're in unit two, we have a one standard practice,

and if they're in unit one, which they're not, we have a zero.

Also standard practice.

But notice something else here.

We have a lot of rows where there is no unit.

Now, we don't know

where that data came from, not unit two.

And the background that was published on the cable site tells us that the third

unit is going to be held back to score the model.

So it wasn't made public.

So we don't know where that data came from.

And I'll discuss that in further detail in a little bit.

Now let's scroll back over and look at some other things about this data table.

We have a whole lot of missing data and some of them let's take a look at some

of the columns here, and we can just scroll down.

Here the billerobin direct.

There's one we had to go down 63 rows before we found one set of data.

Here's another one at row 113.

So there's not a whole lot of data in there.

As a matter of fact, that column is only

3% populated, and there's a whole lot of other columns

that are populated at a similar rate.

That was the worst example, but they're a whole lot of five and ten.

And we also have the problem with submission unit assignment.

So let me close

that data table and I will open another one.

There we are.

Now, I made some modifications

to that first table and I decided to save some time and not make you sit through

and just watch me clicking on columns.

Notice over here in the columns area, we have these two symbols right here.

One hides columns and the other one excludes them from the analysis.

And if you'll notice this one particular

one, et CO2, it was right about here in the first

data set and it's missing.

So we hid them and we're going to exclude them from the analysis.

I also did two things which it's just personal to me.

Number one, I moved our target variable.

What we want to predict is sepsis

to the left, so that when I view the data table, I can just scan across the rows and

see some relationships if there's something I want to see.

And they also moved this one over because I knew this

just from all, for lack of a better term,

general intuition that this was going to be an important variable and that's

ICU, Los, which is intensive care, unit length of stay.

Now let's look at some other things around here.

I want to note one other thing I excluded.

Where was it?

Right here. You can't see it hospital admission time.

And what that is, we think, is the time between they were admitted

to the hospital and the time they were admitted to the ICU.

And a lot of those numbers are negative, but there's nothing in the documentation

that tells us how you can have a negative time.

So I excluded those also.

And the other rows that I excluded, like the bridge in there right above it.

It's the same deal with those.

There's a lot of missing data,

so I just excluded those.

And let's see,

is there anything else I need to do here?

No, it's time to do the analysis.

But before we do that, let me tell you what the overall plan is going to be.

And that is after we examine the data

and clean it and prep it, which we've already done,

we're going to look at the individual units.

We're going to look at is sepsis,

in other words, whether or not the people develop sepsis,

and we're going to do some database management.

So let's get started.

The first thing I'd like to do

is go up here to Tables

and then go to Subset, and we get the pop up window from JMP.

It says create a new data table, et cetera, et cetera.

So let me click on that.

And of course, I want to check this box here that says Subset, Buy.

And I'll Scroll down because we knew there were two units and there was also,

in effect, a third unit, which was neither unit one or unit two.

So let me just click on unit one,

and I want to pull in all the rows from that one.

And I'd like to keep a Buy column just as

a safety check, and you'll find out why in just a minute here.

And we could keep the dialogue open.

But I've run through this analysis before,

so right now, hopefully I won't make a mistake.

It won't change my mind.

And there'll be no need to keep the dialogue box open.

And

for right now, we'll skip the output table name,

and I don't want to link it to the original data table,

but I'll come down here to this box and save the script to the source table

and take one last look over this and everything looks okay.

And I'll click the okay.

Now let me separate these a little bit.

Remember, we had unit one, unit two,

and missing, which was neither unit one or unit two.

And I've got three data tables here.

And at the titles,

it says unit one, JMP probe.

Well, let's just look at that.

It says here unit one.

And if I Scroll over,

it says unit one and unit two.

And this is the reason I kept the Buy columns right here.

It's all missing, dad.

And we scroll down a little bit just to double check.

And yeah, it looks like they're missing.

So it says unit one.

So what we want to do is

relabel that right click, hit, edit, and we can change that to missing.

And I'll type it in and we'll hit okay,

and we now know what that is.

It says unit one equals

I mistyped it should be missing,

but I'll leave it go for right now because we can't analyze that because we don't

know where it came from or rather, we don't want to analyze it.

So I'll just close that out and I don't want to save the changes.

Now, this one says unit one equals zero.

And I'll come over here and yeah, sure enough, there's zero in unit one,

which means it does not belong to unit one.

And over here it says unit two and there's one in the columns.

Scroll down a little bit just to make sure.

And yeah, it says unit two.

Okay, so what we'll want to do is go up

here and we can click that and we could edit one, edit it.

And now I want to change that

and we'll do that.

And here where it says subset script.

We can go up here

and I want to change this.

I can edit that and I'll change that to unit, too, but it won't do it right now.

And it says right here, if we want to check, we're looking at unit

and right here it says keep by column one, unit one, and that's

the column it is.

So I'll just cancel out of that for now.

And it's the same with this data table over here.

We can go to source code, or rather the source edit.

It says keep by columns one, columns one,

and we'll cancel out of that.

And that's it for the units for right now.

Now, what we could do is go up here

and do another

table platform.

We've got summary and subset and whatnot we could click on summary,

and we could drag

what we want to summarize in here and we could pick our statistics,

whether we want to count in the mean, standard deviation, the medium,

excuse me and whole lot of other stuff, but we'll skip that for now

and let me cancel that out

and I'll close these two tables and I won't save them because I have

a couple of tables down here where I already did this.

So let me take come up here and I'll Select unit one

and unit two and I'll open them.

Now, if we wanted to see the difference in outcomes, for example,

between unit one and unit two, we could analyze both of these status sets

separately, but in the interest of time, we will not do that.

What we will do, we combine them

and analyze them together.

Now, let's show another feature of JMP

that makes the data handling part of our job easy.

And let me go up to tables

and what we want to do here is concatenate and the helper window shows up.

It says combines rows from several data tables.

We have a number of other selections we

could have made to avoid us having to write some SQL code, some SQL code.

But here we want to concatenate it.

And unit one showed up on top and that's good.

And what we want to do is concatenate unit two.

So I'll click on that and I'll click that and let's give this a name.

Let's call this.

How about both

this or not? Check it.

I'll just check it here now.

And we could create a source column and again as a check just to make sure

everything is proceeding like we want it to proceed.

I'll create the source column and I could keep the dialogue box open.

That's right here.

But I will close it for now since hopefully I didn't make any mistakes.

Won't have to go back

and we'll click the run button

and let's see. It didn't come up.

Let me try that again.

Let me close that window.

We'll leave it like that and let's see what happens.

There we go.

Must have fat fingers this morning.

And this is a combined data table and as I mentioned, I'd like to keep that open.

It's our source table.

Normally what I do is drag that somewhere off the edge of the table.

But for right now, we'll leave it there and we'll just scroll down a little bit

and we note that we have unit one, unit one.

There we go. Unit two.

So that serves as a check.

So we have that there.

Now it's time finally to start the analysis.

We can do a number of things here.

First of all, we note that most of our variables are continuous

except for our target variable, which is binary on or off, yes or no, et cetera.

But let's just take a look at the distributions.

I always like to point this out in the analysis.

This is one of my favorite features to JMP.

You can do a quick inspection right here

to see if there's anything weird and let's see.

Okay, scroll back over.

I don't see anything that pops out at me.

It looks like unit one and unit two.

We're right in there.

One thing to note right here is sepsis one there's.

There's a lot fewer rows where the patient went into sepsis.

And let's see, I think it was about 7%.

So we're looking at 93 here and 7% here.

So let me get rid of that.

That's what I like to do.

Now let's go up to the analyze menu, and we're going to make use of this

pretty much exclusively from now until the end of the presentation.

So I'm going to go down here,

I'm going to choose multivariate methods

from the drop down, and I get another drop down.

And what I'm going to do, come on, is choose multivariate.

And this window pops up and wants to know the why call.

By the way, this is the reason why I like

to put another reason I like to put the target variable over on the left.

It's right here,

and we can pop it into the Y column and we don't have to scroll and Hunt for it.

So let's see, we've got everything in there.

We don't want unit one or unit two.

What else do we have?

Gender, age?

I'll tell you what, let's

put them all in there

and click my call

hit. Okay.

And here's what we get.

We get a correlation matrix.

Now let's take a little closer look at this.

It's a little confusing because we put a whole lot of variables in.

But again, that's one of the advantages to JMP.

You can pop them all in there and don't have to write extra code for it.

So we have our diagonal here.

Our matrix is reflected along the diagonal.

So it's the same data top and bottom and some different colors here.

One means a high correlation.

It's statistically significant, which is what we expect.

The blood pressure, for example, should be correlated fairly well with itself.

But we also note that right here, right under it,

it's correlated with something called Map, which is mean arterial pressure.

So that's mean between a systolic and a diastolic.

So that makes sense.

And we have DBP, which is diastolic blood pressure.

So that makes sense.

And there's not a whole lot we can see here.

Here's another correlation.

Okay.

That's bun for the urea content.

Blood urea nitrogen is what it stands for.

That's the urea content of the blood,

which is a byproduct of cellular physiology.

And that looks like it's correlated

with something over here, potassium, but pretty much that's about it.

Here's a Hema crit, which looks like it's correlated with right over here, the HGB.

If we read over,

everybody see that square here?

Take a second to look at it.

And HCT is hemacrit, and HGB is hemoglobin.

And hemoglobin is a direct measure of hemacrit or excuse me, the hemoglobin.

And the hemochret is the I believe it's the volume fraction of red blood cells.

So you would expect them to be correlated.

So

not a whole lot to see here outside of that.

May as well close that.

Next, we're going to go back up to the analyze menu.

We're going to go to analyze.

And from the drop down, we're going to go to screening.

And again, we have another drop down.

And let's Hover over this.

It's called predictor screening.

It screens many predictions for their ability to predict an outcome.

So we want to be able to predict

whether or not a particular patient is

going to develop sepsis so that it looks like a good choice.

So we click it and this is the window that we get.

And again, we want to know is sepsis.

One is yes, zero is no.

So we click that and we're presented

with the range again, the same range we had before.

And let's do what we did before.

Start here.

We're going to go down here to gender,

ignoring the units again, hit the shift button,

click and we select all of them and we're going to hit the X button.

And there's nothing else there for us to take note of.

Doesn't look like anything else to click.

So do that. We'll click.

Okay.

And there we go.

And JMP tells us that it's doing a bootstrap forest.

We could do a whole presentation on bootstrap forest, but we don't have time.

In fact, we could probably do two or three or four.

And we are getting there.

It's scoring the results and we just have to wait.

It's taking a while for some reason.

And there we are.

Let's look at what we have here.

We have the contribution, which is the net contribution, not scaled, not scaled

to the model.

I have to use the word again portion

that contributes to the model.

And you can think of this as a weight

fraction or if you prefer, multiply by 100 in your head to make it a percent.

And if we just take a quick look at that, we see you at zero, 61 and three.

So we're at zero, 74, six.

Looks like it takes us up to zero, eight or 80%

explanation.

So that's what that is.

So all those make sense.

Now let's look at that Iculos

before, I talked about excluding that well intensive care unit length of stay

that predicts what looks like more than the others combined, probably.

However,

if we use that, that would be a little bit of circular reasoning.

If people develop sepsis, they're almost certainly going to end up in the ICU.

If they are really sick, they may be at higher risk of developing sepsis.

So they are going to end up in the ICU.

So if they're in their ICU, they're probably pretty sick to begin with.

They're already developing sepsis and they're going to be in there for a while.

So

let's go up here in Excel.

That really

doesn't help us too much because it's not

something we can measure like blood pressure.

I mean, they're already in there or they're not in there.

So let's exile to that and we'll go up here to analyze

back down to screening,

predictor, screening.

Hit is sepsis for the Y response.

And let's leave that one out.

Let's leave the ICU Los out and we'll do everything exactly the same as before.

Hit, shift, gender, select everything.

Hit the X button.

Nothing else for us to do there.

It doesn't look like Hit.

Okay.

And we'll just wait for a little while.

Again,

looks like it's running a little bit faster this time.

And here we go.

Now we had the ICU Los completely out

of the picture and we see something else here.

The Bun blood urea nitrogen looks like it's in the running for a significant

predictor after that temperature, which makes sense.

If you develop sepsis, you have an infection.

So you're probably going to have a temperature.

Creatinine is a byproduct of muscle breakdown.

So that makes sense.

And remember, we did our research before

we started the analysis and after that respirations, that make sense.

Shallow, rapid breathing, hemoglobin content, Hema, crit.

Okay, they're highly correlated blood pressure.

And this WBC I didn't point out before, but that's white blood cell count.

So that makes sense too.

Now we have a decision to make.

This is obviously

the most prominent don't want to say important.

We're not sure that yet, but it's the most prominent.

After that comes temperature and creatinine.

And then there's a large drop

in the rankings and the importance of the rankings in the portion here.

So we have a decision to make.

So

let's start up there with the blood urea nitrogen and let's go down.

Let's pull in as much as we can because JMP is going to make this,

all the repetitive tasks,

all the calculations, easy for us by taking them away from us.

So Shift click.

Let's go down to systolic blood pressure.

That's probably going to play a role

because if you have sepsis, you tend to have very low blood pressure.

Dangerous.

And we have some other measurements down here, but we'll just skip those.

By the way, these two are going to be correlated.

This one right here is the partial pressure of the carbon dioxide

in the blood and this is the carbonate content of the blood.

So those are going to be related.

So it doesn't look like there's anything else of importance.

And JMP puts in a handy link there.

It says copy selected.

So let's do that.

I copied the selected and I'll just leave

that open for right now and we'll go up here to analyze once again.

And what we want to do now is fit the model.

It says fits a linear regression.

So let's go there.

And since we copied selection

in the previous window, JMP remembered that for us.

And what we want to do is click the add button

and we'd like them to construct the model effects.

That means we want to use them as a modeling variable.

And what else we have here.

Notice this upper right hand corner we have something called personality.

So focus on that right hand corner for just the next 30 seconds or so.

And I'll go over here and hit his sepsis and we'll put that in the wine.

I'll look at the upper right hand corner personality.

I click the Y and I get a choice here.

I get some choices in the drop down menu

and I get an emphasis window and I'll click the drop down triangle here

and I have a whole lot of choices here and we won't go over them right now.

But probably what I want is a generalized linear model.

And if I Hover over it, let me try that again.

It gives us a pop up window.

It says fits a generalized model, and try it once again.

And I get to select the distribution

and the link and all that sort of thing, which I'll go over in a couple of seconds

here, and I get another drop down here a link function.

So let's start with distribution.

And

remember, we expanded that top column and we got our distributions.

We took a look at it didn't look like

there's anything weird there, at least on a macro scale.

So let's just pick normal.

And I want logic

for the link function

because we've got a binary variable that we're trying to predict.

And let's see,

take a closer look.

Nothing else left for me to do.

It doesn't look like.

So I'll hit the run button

and here we go.

Here's our generalized linear model fit, and it gives us a summary up.

Here what we looked at.

We got a Kai square.

I won't go over that in any great detail, but let's scroll down a little bit more,

cut that off.

Here we have an effect summary.

And just by the way, if we click on these triangles,

we can hide them or make them appear again.

So depending on what we want to present,

I'll just leave them all open for right now.

And what we have up here is the source.

And there it is.

Bloody reinrogen.

The bun is up there very high again, followed by temp, creatinine,

white blood cell content, heart rate, and some blood pressure measurements.

And all this makes sense from what we know

about sepsis and log worth is contribution.

And over here we have P value and we see that they're all very highly significant

up to right about here, the white blood cell content.

And we see a blue line here.

And what that is the log worth of two.

And the reason we take the log worth

is because we'd like to be able to Doe things on the graph.

So we put it on a log skill and that just makes it easier.

Otherwise the spray bar up here would be off the edge of my screen.

And this blue line here is

a significance level

because it's the log worth

and zero one significance level is log of negative two.

Excuse me, is the value of zero two, which is log of two.

Get rid of the negative sign.

And that's what this blue line is.

So these are all significant up to HR.

And if we come down here,

we have the square results and we see some significance levels here, too.

And basically, we're looking at the respiration,

the bun and the creatinine and also the temperature.

They're all highly almost forgot the white blood cell content down here.

And I'm running out of time, so I won't explain a whole lot about that.

But if we scroll down a little bit more, we can get an estimate

for our predicted variables here.

This is an estimate of the exponent and it gives us more statistical data on that.

And I'm starting to run out of time.

So let me just minimize those windows

and we'll get rid of all the highlights and let's recap

what we had here.

Here is our original, Highly cleaned and rearranged data table.

We want to predict sepsis, which is binary.

And we ruled out the lens of stay in the ICU unit that is right here Because

it didn't help us and it was kind of circular logic.

And we've got our variables in three separate groups up here.

We start off with the clinical and then we come over here and we had all

our blood tests and then we had the demographic data and we had two units

and we excluded all 25 or 30% of the data Because

the data wasn't assigned to either unit and we don't know where it came from.

So we got rid of that and then we subseted everything.

Remember, we got to well, actually, we got three separate subsets Because

the third subset was the missing data unit unit being in quotes.

And we went from there and we went

to the multivariate to look for some correlations.

Then we went to the analyze screening, predictor screening,

and we got what we

figured was going to be our most valuable predictors to predict sepsis.

And finally we went to fit model and let me

reiterate on that.

We went up to analyze

fit model

and we clicked that and we got this window

and we dumped everything in there except for what we wanted to exclude.

And we put sepsis right here in this y variable.

And remember, this is the area that we had

to focus on up here in the upper right hand corner.

We had the personality and a couple

of other selections to make, and we selected from that.

And we got our results, which I

went over about a minute ago, and that is the end of the presentation.

I hope everybody enjoys even learned a little something from it.

Thank you for watching, listening, and giving it your attention.