Good morning. Good afternoon.
Good evening, everyone.
My name is Stan Saranovich,
and I am the principal analyst at Crucial Connection, LLC.
And I am located in Jeffersonville, Indiana, right across the river
from Louisville, Kentucky, United States of America.
And today I'm going to talk about sepsis
predictions from clinical data using JMP Pro 16.
But almost all of what I am going to do
here will be available in the standard version of JMP.
Now let's talk about sepsis for a minute.
To start off, sepsis is a life threatening condition which occurs when the body's
response to infection causes tissue damage and cause organ failure or even death.
In fact, sepsis costs United States
hospital more than any other health condition.
So if we could predict sepsis and detect it early, we could improve the outcome
of critical care patients and also lower the cost of health care.
So we're going to look at a data set
today, and this is an actual data set of clinical data.
It was collected from two hospitals
in Boston, Massachusetts, area in the United States.
And these two data tables were published as a contest on Kaggle by a Cardiology
group, and the results were eventually published in the Cardiology Journal.
Now, there were three units involved
in this study, and in this context, I have the data for what we'll call unit
one and unit two, which is data from two ICU intensive care units.
The third group of data was not made publicly available and was held back
for the contest, and it's still not publicly available.
So what we'll do now is examine the data and see if we can predict
sepsis and what variables we should be following to avoid sepsis, which,
of course, can be a life threatening condition.
Now I have a data set in front of me,
and this was downloaded from the Kaggle site and imported into JMP.
And let's take a closer look at it.
Usually I like to JMP right in to the data analysis,
but in this particular case, that's not going to be a good idea.
And after we look over the data, you'll see why.
First of all, let's look over at the left
of the JMP data table, and we have the columns window.
I prefer to do a lot of the work from here
and here we see a number of variables.
So let's start with heart rate
right here.
That's going to be an important predictor, probably, but we don't know that.
And we have the rest of the predictors over here.
And there are actually 40 of them.
Well, no, 38, if you don't count the units, two saturation, et cetera.
And we could go through the list right here temperature.
But what you'll notice and for this, I'm going to have to scroll
is that the first six or eight columns are just clinical data.
We have systolic blood pressure.
We have respirations diastolic blood pressure.
And as we go across our data table, we see some lab data.
We're talking about glucose and lactate
levels, magnesiums, phosphate, potassium, bilirubin, et cetera, et cetera.
And finally, we have a set of columns,
which I guess we could call it demographic data.
We have the age right here, gender, and what unit they belong to.
Now, while we're on units, let's take a look here.
We have unit one and unit two.
And we know from doing our background
research and we all do background research.
Right before we start the analysis,
we know that one is a Cardiology unit and the other one is a surgery unit.
So if they're in unit two, we have a one standard practice,
and if they're in unit one, which they're not, we have a zero.
Also standard practice.
But notice something else here.
We have a lot of rows where there is no unit.
Now, we don't know
where that data came from, not unit two.
And the background that was published on the cable site tells us that the third
unit is going to be held back to score the model.
So it wasn't made public.
So we don't know where that data came from.
And I'll discuss that in further detail in a little bit.
Now let's scroll back over and look at some other things about this data table.
We have a whole lot of missing data and some of them let's take a look at some
of the columns here, and we can just scroll down.
Here the billerobin direct.
There's one we had to go down 63 rows before we found one set of data.
Here's another one at row 113.
So there's not a whole lot of data in there.
As a matter of fact, that column is only
3% populated, and there's a whole lot of other columns
that are populated at a similar rate.
That was the worst example, but they're a whole lot of five and ten.
And we also have the problem with submission unit assignment.
So let me close
that data table and I will open another one.
There we are.
Now, I made some modifications
to that first table and I decided to save some time and not make you sit through
and just watch me clicking on columns.
Notice over here in the columns area, we have these two symbols right here.
One hides columns and the other one excludes them from the analysis.
And if you'll notice this one particular
one, et CO2, it was right about here in the first
data set and it's missing.
So we hid them and we're going to exclude them from the analysis.
I also did two things which it's just personal to me.
Number one, I moved our target variable.
What we want to predict is sepsis
to the left, so that when I view the data table, I can just scan across the rows and
see some relationships if there's something I want to see.
And they also moved this one over because I knew this
just from all, for lack of a better term,
general intuition that this was going to be an important variable and that's
ICU, Los, which is intensive care, unit length of stay.
Now let's look at some other things around here.
I want to note one other thing I excluded.
Where was it?
Right here. You can't see it hospital admission time.
And what that is, we think, is the time between they were admitted
to the hospital and the time they were admitted to the ICU.
And a lot of those numbers are negative, but there's nothing in the documentation
that tells us how you can have a negative time.
So I excluded those also.
And the other rows that I excluded, like the bridge in there right above it.
It's the same deal with those.
There's a lot of missing data,
so I just excluded those.
And let's see,
is there anything else I need to do here?
No, it's time to do the analysis.
But before we do that, let me tell you what the overall plan is going to be.
And that is after we examine the data
and clean it and prep it, which we've already done,
we're going to look at the individual units.
We're going to look at is sepsis,
in other words, whether or not the people develop sepsis,
and we're going to do some database management.
So let's get started.
The first thing I'd like to do
is go up here to Tables
and then go to Subset, and we get the pop up window from JMP.
It says create a new data table, et cetera, et cetera.
So let me click on that.
And of course, I want to check this box here that says Subset, Buy.
And I'll Scroll down because we knew there were two units and there was also,
in effect, a third unit, which was neither unit one or unit two.
So let me just click on unit one,
and I want to pull in all the rows from that one.
And I'd like to keep a Buy column just as
a safety check, and you'll find out why in just a minute here.
And we could keep the dialogue open.
But I've run through this analysis before,
so right now, hopefully I won't make a mistake.
It won't change my mind.
And there'll be no need to keep the dialogue box open.
And
for right now, we'll skip the output table name,
and I don't want to link it to the original data table,
but I'll come down here to this box and save the script to the source table
and take one last look over this and everything looks okay.
And I'll click the okay.
Now let me separate these a little bit.
Remember, we had unit one, unit two,
and missing, which was neither unit one or unit two.
And I've got three data tables here.
And at the titles,
it says unit one, JMP probe.
Well, let's just look at that.
It says here unit one.
And if I Scroll over,
it says unit one and unit two.
And this is the reason I kept the Buy columns right here.
It's all missing, dad.
And we scroll down a little bit just to double check.
And yeah, it looks like they're missing.
So it says unit one.
So what we want to do is
relabel that right click, hit, edit, and we can change that to missing.
And I'll type it in and we'll hit okay,
and we now know what that is.
It says unit one equals
I mistyped it should be missing,
but I'll leave it go for right now because we can't analyze that because we don't
know where it came from or rather, we don't want to analyze it.
So I'll just close that out and I don't want to save the changes.
Now, this one says unit one equals zero.
And I'll come over here and yeah, sure enough, there's zero in unit one,
which means it does not belong to unit one.
And over here it says unit two and there's one in the columns.
Scroll down a little bit just to make sure.
And yeah, it says unit two.
Okay, so what we'll want to do is go up
here and we can click that and we could edit one, edit it.
And now I want to change that
and we'll do that.
And here where it says subset script.
We can go up here
and I want to change this.
I can edit that and I'll change that to unit, too, but it won't do it right now.
And it says right here, if we want to check, we're looking at unit
and right here it says keep by column one, unit one, and that's
the column it is.
So I'll just cancel out of that for now.
And it's the same with this data table over here.
We can go to source code, or rather the source edit.
It says keep by columns one, columns one,
and we'll cancel out of that.
And that's it for the units for right now.
Now, what we could do is go up here
and do another
table platform.
We've got summary and subset and whatnot we could click on summary,
and we could drag
what we want to summarize in here and we could pick our statistics,
whether we want to count in the mean, standard deviation, the medium,
excuse me and whole lot of other stuff, but we'll skip that for now
and let me cancel that out
and I'll close these two tables and I won't save them because I have
a couple of tables down here where I already did this.
So let me take come up here and I'll Select unit one
and unit two and I'll open them.
Now, if we wanted to see the difference in outcomes, for example,
between unit one and unit two, we could analyze both of these status sets
separately, but in the interest of time, we will not do that.
What we will do, we combine them
and analyze them together.
Now, let's show another feature of JMP
that makes the data handling part of our job easy.
And let me go up to tables
and what we want to do here is concatenate and the helper window shows up.
It says combines rows from several data tables.
We have a number of other selections we
could have made to avoid us having to write some SQL code, some SQL code.
But here we want to concatenate it.
And unit one showed up on top and that's good.
And what we want to do is concatenate unit two.
So I'll click on that and I'll click that and let's give this a name.
Let's call this.
How about both
this or not? Check it.
I'll just check it here now.
And we could create a source column and again as a check just to make sure
everything is proceeding like we want it to proceed.
I'll create the source column and I could keep the dialogue box open.
That's right here.
But I will close it for now since hopefully I didn't make any mistakes.
Won't have to go back
and we'll click the run button
and let's see. It didn't come up.
Let me try that again.
Let me close that window.
We'll leave it like that and let's see what happens.
There we go.
Must have fat fingers this morning.
And this is a combined data table and as I mentioned, I'd like to keep that open.
It's our source table.
Normally what I do is drag that somewhere off the edge of the table.
But for right now, we'll leave it there and we'll just scroll down a little bit
and we note that we have unit one, unit one.
There we go. Unit two.
So that serves as a check.
So we have that there.
Now it's time finally to start the analysis.
We can do a number of things here.
First of all, we note that most of our variables are continuous
except for our target variable, which is binary on or off, yes or no, et cetera.
But let's just take a look at the distributions.
I always like to point this out in the analysis.
This is one of my favorite features to JMP.
You can do a quick inspection right here
to see if there's anything weird and let's see.
Okay, scroll back over.
I don't see anything that pops out at me.
It looks like unit one and unit two.
We're right in there.
One thing to note right here is sepsis one there's.
There's a lot fewer rows where the patient went into sepsis.
And let's see, I think it was about 7%.
So we're looking at 93 here and 7% here.
So let me get rid of that.
That's what I like to do.
Now let's go up to the analyze menu, and we're going to make use of this
pretty much exclusively from now until the end of the presentation.
So I'm going to go down here,
I'm going to choose multivariate methods
from the drop down, and I get another drop down.
And what I'm going to do, come on, is choose multivariate.
And this window pops up and wants to know the why call.
By the way, this is the reason why I like
to put another reason I like to put the target variable over on the left.
It's right here,
and we can pop it into the Y column and we don't have to scroll and Hunt for it.
So let's see, we've got everything in there.
We don't want unit one or unit two.
What else do we have?
Gender, age?
I'll tell you what, let's
put them all in there
and click my call
hit. Okay.
And here's what we get.
We get a correlation matrix.
Now let's take a little closer look at this.
It's a little confusing because we put a whole lot of variables in.
But again, that's one of the advantages to JMP.
You can pop them all in there and don't have to write extra code for it.
So we have our diagonal here.
Our matrix is reflected along the diagonal.
So it's the same data top and bottom and some different colors here.
One means a high correlation.
It's statistically significant, which is what we expect.
The blood pressure, for example, should be correlated fairly well with itself.
But we also note that right here, right under it,
it's correlated with something called Map, which is mean arterial pressure.
So that's mean between a systolic and a diastolic.
So that makes sense.
And we have DBP, which is diastolic blood pressure.
So that makes sense.
And there's not a whole lot we can see here.
Here's another correlation.
Okay.
That's bun for the urea content.
Blood urea nitrogen is what it stands for.
That's the urea content of the blood,
which is a byproduct of cellular physiology.
And that looks like it's correlated
with something over here, potassium, but pretty much that's about it.
Here's a Hema crit, which looks like it's correlated with right over here, the HGB.
If we read over,
everybody see that square here?
Take a second to look at it.
And HCT is hemacrit, and HGB is hemoglobin.
And hemoglobin is a direct measure of hemacrit or excuse me, the hemoglobin.
And the hemochret is the I believe it's the volume fraction of red blood cells.
So you would expect them to be correlated.
So
not a whole lot to see here outside of that.
May as well close that.
Next, we're going to go back up to the analyze menu.
We're going to go to analyze.
And from the drop down, we're going to go to screening.
And again, we have another drop down.
And let's Hover over this.
It's called predictor screening.
It screens many predictions for their ability to predict an outcome.
So we want to be able to predict
whether or not a particular patient is
going to develop sepsis so that it looks like a good choice.
So we click it and this is the window that we get.
And again, we want to know is sepsis.
One is yes, zero is no.
So we click that and we're presented
with the range again, the same range we had before.
And let's do what we did before.
Start here.
We're going to go down here to gender,
ignoring the units again, hit the shift button,
click and we select all of them and we're going to hit the X button.
And there's nothing else there for us to take note of.
Doesn't look like anything else to click.
So do that. We'll click.
Okay.
And there we go.
And JMP tells us that it's doing a bootstrap forest.
We could do a whole presentation on bootstrap forest, but we don't have time.
In fact, we could probably do two or three or four.
And we are getting there.
It's scoring the results and we just have to wait.
It's taking a while for some reason.
And there we are.
Let's look at what we have here.
We have the contribution, which is the net contribution, not scaled, not scaled
to the model.
I have to use the word again portion
that contributes to the model.
And you can think of this as a weight
fraction or if you prefer, multiply by 100 in your head to make it a percent.
And if we just take a quick look at that, we see you at zero, 61 and three.
So we're at zero, 74, six.
Looks like it takes us up to zero, eight or 80%
explanation.
So that's what that is.
So all those make sense.
Now let's look at that Iculos
before, I talked about excluding that well intensive care unit length of stay
that predicts what looks like more than the others combined, probably.
However,
if we use that, that would be a little bit of circular reasoning.
If people develop sepsis, they're almost certainly going to end up in the ICU.
If they are really sick, they may be at higher risk of developing sepsis.
So they are going to end up in the ICU.
So if they're in their ICU, they're probably pretty sick to begin with.
They're already developing sepsis and they're going to be in there for a while.
So
let's go up here in Excel.
That really
doesn't help us too much because it's not
something we can measure like blood pressure.
I mean, they're already in there or they're not in there.
So let's exile to that and we'll go up here to analyze
back down to screening,
predictor, screening.
Hit is sepsis for the Y response.
And let's leave that one out.
Let's leave the ICU Los out and we'll do everything exactly the same as before.
Hit, shift, gender, select everything.
Hit the X button.
Nothing else for us to do there.
It doesn't look like Hit.
Okay.
And we'll just wait for a little while.
Again,
looks like it's running a little bit faster this time.
And here we go.
Now we had the ICU Los completely out
of the picture and we see something else here.
The Bun blood urea nitrogen looks like it's in the running for a significant
predictor after that temperature, which makes sense.
If you develop sepsis, you have an infection.
So you're probably going to have a temperature.
Creatinine is a byproduct of muscle breakdown.
So that makes sense.
And remember, we did our research before
we started the analysis and after that respirations, that make sense.
Shallow, rapid breathing, hemoglobin content, Hema, crit.
Okay, they're highly correlated blood pressure.
And this WBC I didn't point out before, but that's white blood cell count.
So that makes sense too.
Now we have a decision to make.
This is obviously
the most prominent don't want to say important.
We're not sure that yet, but it's the most prominent.
After that comes temperature and creatinine.
And then there's a large drop
in the rankings and the importance of the rankings in the portion here.
So we have a decision to make.
So
let's start up there with the blood urea nitrogen and let's go down.
Let's pull in as much as we can because JMP is going to make this,
all the repetitive tasks,
all the calculations, easy for us by taking them away from us.
So Shift click.
Let's go down to systolic blood pressure.
That's probably going to play a role
because if you have sepsis, you tend to have very low blood pressure.
Dangerous.
And we have some other measurements down here, but we'll just skip those.
By the way, these two are going to be correlated.
This one right here is the partial pressure of the carbon dioxide
in the blood and this is the carbonate content of the blood.
So those are going to be related.
So it doesn't look like there's anything else of importance.
And JMP puts in a handy link there.
It says copy selected.
So let's do that.
I copied the selected and I'll just leave
that open for right now and we'll go up here to analyze once again.
And what we want to do now is fit the model.
It says fits a linear regression.
So let's go there.
And since we copied selection
in the previous window, JMP remembered that for us.
And what we want to do is click the add button
and we'd like them to construct the model effects.
That means we want to use them as a modeling variable.
And what else we have here.
Notice this upper right hand corner we have something called personality.
So focus on that right hand corner for just the next 30 seconds or so.
And I'll go over here and hit his sepsis and we'll put that in the wine.
I'll look at the upper right hand corner personality.
I click the Y and I get a choice here.
I get some choices in the drop down menu
and I get an emphasis window and I'll click the drop down triangle here
and I have a whole lot of choices here and we won't go over them right now.
But probably what I want is a generalized linear model.
And if I Hover over it, let me try that again.
It gives us a pop up window.
It says fits a generalized model, and try it once again.
And I get to select the distribution
and the link and all that sort of thing, which I'll go over in a couple of seconds
here, and I get another drop down here a link function.
So let's start with distribution.
And
remember, we expanded that top column and we got our distributions.
We took a look at it didn't look like
there's anything weird there, at least on a macro scale.
So let's just pick normal.
And I want logic
for the link function
because we've got a binary variable that we're trying to predict.
And let's see,
take a closer look.
Nothing else left for me to do.
It doesn't look like.
So I'll hit the run button
and here we go.
Here's our generalized linear model fit, and it gives us a summary up.
Here what we looked at.
We got a Kai square.
I won't go over that in any great detail, but let's scroll down a little bit more,
cut that off.
Here we have an effect summary.
And just by the way, if we click on these triangles,
we can hide them or make them appear again.
So depending on what we want to present,
I'll just leave them all open for right now.
And what we have up here is the source.
And there it is.
Bloody reinrogen.
The bun is up there very high again, followed by temp, creatinine,
white blood cell content, heart rate, and some blood pressure measurements.
And all this makes sense from what we know
about sepsis and log worth is contribution.
And over here we have P value and we see that they're all very highly significant
up to right about here, the white blood cell content.
And we see a blue line here.
And what that is the log worth of two.
And the reason we take the log worth
is because we'd like to be able to Doe things on the graph.
So we put it on a log skill and that just makes it easier.
Otherwise the spray bar up here would be off the edge of my screen.
And this blue line here is
a significance level
because it's the log worth
and zero one significance level is log of negative two.
Excuse me, is the value of zero two, which is log of two.
Get rid of the negative sign.
And that's what this blue line is.
So these are all significant up to HR.
And if we come down here,
we have the square results and we see some significance levels here, too.
And basically, we're looking at the respiration,
the bun and the creatinine and also the temperature.
They're all highly almost forgot the white blood cell content down here.
And I'm running out of time, so I won't explain a whole lot about that.
But if we scroll down a little bit more, we can get an estimate
for our predicted variables here.
This is an estimate of the exponent and it gives us more statistical data on that.
And I'm starting to run out of time.
So let me just minimize those windows
and we'll get rid of all the highlights and let's recap
what we had here.
Here is our original, Highly cleaned and rearranged data table.
We want to predict sepsis, which is binary.
And we ruled out the lens of stay in the ICU unit that is right here Because
it didn't help us and it was kind of circular logic.
And we've got our variables in three separate groups up here.
We start off with the clinical and then we come over here and we had all
our blood tests and then we had the demographic data and we had two units
and we excluded all 25 or 30% of the data Because
the data wasn't assigned to either unit and we don't know where it came from.
So we got rid of that and then we subseted everything.
Remember, we got to well, actually, we got three separate subsets Because
the third subset was the missing data unit unit being in quotes.
And we went from there and we went
to the multivariate to look for some correlations.
Then we went to the analyze screening, predictor screening,
and we got what we
figured was going to be our most valuable predictors to predict sepsis.
And finally we went to fit model and let me
reiterate on that.
We went up to analyze
fit model
and we clicked that and we got this window
and we dumped everything in there except for what we wanted to exclude.
And we put sepsis right here in this y variable.
And remember, this is the area that we had
to focus on up here in the upper right hand corner.
We had the personality and a couple
of other selections to make, and we selected from that.
And we got our results, which I
went over about a minute ago, and that is the end of the presentation.
I hope everybody enjoys even learned a little something from it.
Thank you for watching, listening, and giving it your attention.