Showing results for 
Show  only  | Search instead for 
Did you mean: 
Submit your abstract to the call for content for Discovery Summit Americas by April 23. Selected abstracts will be presented at Discovery Summit, Oct. 21- 24.
Discovery is online this week, April 16 and 18. Join us for these exciting interactive sessions.
Choose Language Hide Translation Bar
Hospital Admissions Data for Sepsis Prediction
Level III

Sepsis is a life-threatening condition which occurs when the body's response to infection causes tissue damage, organ failure, or death. In fact, Sepsis costs U.S. hospitals more than any other health condition, and a majority of these costs is for sepsis patients who were not diagnosed at admission. Thus, early detection and treatment would be critical for improving outcomes. In this session, we will examine an actual clinical data set, obtained from two U.S. hospitals, and recently published on Kaggle. In particular, we will examine a number of predictors, drawn from a combination of vital signs, demographic groups, and clinical laboratory data. We will use JMP to deal with issues such as missing values, outliers and a highly unbalanced, categorical outcome variable. In addition, we will show how visualization, interactivity, and analytical flow can lead to a more compact and integrated analysis — and a shorter time to discovery.


The tables provided are of the same raw data in a csv file and in a JMP data table. Here are some links which provide additional information:


Critical Care Medicine 

Kaggle Data



I would advise against using this data set.  I've explored this data in relation to the much more complex data it was derived from.  In the original data, there was a file for each ICU patient with hourly data during their stay in the ICU.  The submitted data set here contains one row for each patient.  I've reversed engineered how this data set was created:  for each non-septic patient, a randomly selected hour from that patient's file (some patients had hundreds of hours in the ICU, others far fewer) was reported.  For septic patients, the hourly readings from 7 hours after sepsis was first found is what is shown in the data file.  However, the Challenge data that it was obtained from explains that the sepsis variable is already recorded as 1 if sepsis is found at least 6 hours earlier - so the hourly recordings at that time could be used for predicting whether someone gets sepsis or not.  Since this file was created by using the hourly readings 7 hours earlier than the original file reported sepsis=1, this essentially undoes the time shift in the original data.  In other words, when the file submitted here shows sepsis=1, we are looking at the hourly readings while the person is septic, not 6 hours before.  So, those readings are not useful for early prediction of sepsis (they are concurrent with sepsis already having occurred).  Of course, you will find the prediction models in this data set are very good (AUC>.9), but not at all useful.  It is, however, a good example of target leakage.