cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Choose Language Hide Translation Bar
Automated Data Imputation: A Versatile Tool in JMP® Pro 14 for Handling Missing Values ( US 2018 130 )

Level: Intermediate
Milo Page, JMP Research Statistician Developer, SAS

JMP Pro 14 includes a new Automated Data Imputation (ADI) utility, a versatile, empirically tuned, streaming, missing data imputation method. We recommend it for handling missing data as a pre-processing step to predictive model fitting. It empirically tunes to your data set to extract the underlying structure, even in the presence of missing data. It also respects training and validation partitions and interfaces seamlessly with predictive models. It is developed using powerful matrix completion methods with some added extensions for robustness and flexibility. This talk will focus on when ADI is appropriate and how to use it in JMP Pro 14. I will also outline a recommended workflow for processing data with missing values and demonstrate ADI’s performance on some examples.

Milo_Page_01.jpgMilo_Page_02.jpgMilo_Page_03.jpgMilo_Page_04.jpgMilo_Page_05.jpgMilo_Page_06.jpgMilo_Page_07.jpgMilo_Page_08.jpgMilo_Page_09.jpgMilo_Page_10.jpgMilo_Page_11.jpgMilo_Page_12.jpgMilo_Page_13.jpgMilo_Page_14.jpgMilo_Page_15.jpgMilo_Page_16.jpgMilo_Page_17.jpg

Comments
FR60

Ciao Milo

I found your pdf file very interesting. 

Could you give me more details about the process reported in this slide below?

If I understood well, you are sayng to split data in training and test set before to impute the missing values. Once you divided the data, do you apply the ADI method only to training data. Correct?

Then what do you do on Validation and test set?

What do you mean with Streaming imputed validation/test set?

 

Thanks for your help.  Felice 

 

image.png

@Milo has left SAS. Perhaps @chris_gotwalt1 can answer here.

Hello Felice,

I can help with your questions, I was Milo's Ph.D. advisor for the research that led to the creation of ADI in JMP Pro.

 

If I understood well, you are saying to split data in training and test set before to impute the missing values. Once you divided the data, do you apply the ADI method only to training data. 

The work flow is to go to the Make Validation Column platform to set up the datatable column that partitions the rows into training/validation sets. When you launch Explore Missing Values, set that column in the validation role. The ADI algorithm fits PCA-type models to the training set, and uses the validation set to tune the model for things like identifying the best rank of the dimension reduction. 

 

Then what do you do on Validation and test set?

I recommend using the same Training/Validation or Training/Validation/Test partitioning of the data when I move on to fitting supervised learning algorithms like Neural Networks or Bootstrap Forest models at the next step in the modeling process. Imputation is a part of the modeling process, just as the predictive modeling component is. Using the same partitioning of the rows into Training/Validation/Test sets for the supervised learning step is a way to ensure that we have the same degree of independent assessment of goodness of fit for the entire modeling exercise.

 

What do you mean with Streaming imputed validation/test set?

Maybe our language isn't clear here. What happens is that ADI will save a formula column that creates imputed versions of the existing columns. This means that you can use Tables<<Concatenate to load new observations to the data table. These new observations will immediately be imputed, and the prediction formulas that you have saved will execute, giving predictions for the responses based on the imputed data, with no further work needed on the part of the user.

 

This process happens immediately to the training/validation/test portions of the the data ADI. So, you can go straight to modeling using the imputed versions of the columns without worrying too much about the missing cells. ADI is pretty smart. It uses the validation rows to determine the rank of the dimension reduction used. If the columns are independent, or too messy for a linear rank reduction, it figures that out and simply imputes the training mean of the columns.