cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Choose Language Hide Translation Bar
c_blanken
Level I

Error message for Neural Net With Missing Validation Rows - Suggest Workaround?

Hello community,

 

I am testing several models as part of JSL loop in JMP 16, using different validation columns to determine how the R2 values change depending on what data is used.  Each validation column contains a subset of the data, with some missing data at the beginning and/or the end of the column. 

 

Other models, such as the Bootstrap, run well simply ignoring missing values but the Neural Net keeps throwing an error message (reproduced on the JMP sample data set, CrimeData).

c_blanken_0-1677162099156.png

 

c_blanken_1-1677162122356.png

 

I can work around the problem by including a filter column, setting all missing values to 0 and not running them.  But this will require a filter column for each validation column (or dynamical creating/deleting a column each iteration).  

c_blanken_2-1677162442015.png

 

Neural(
Y( :Murder Rate ),
X(
:Robbery Rate, :"Agg-Aslt Rate"n, :Burglary Rate, :Larceny Rate,
:MVTheft Rate
),
Validation( :Validation ),
Informative Missing( 0 ),
Fit( NTanH( 3 ) ),
Where( :Filter == 1 )
)

 

 

Has anyone else encountered this issue and found a solution?  Thanks!

2 REPLIES 2
SDF1
Super User

Re: Error message for Neural Net With Missing Validation Rows - Suggest Workaround?

Hi @c_blanken ,

 

  It would appear that your validation column is not filling all the rows because you are stratifying your validation column on some other column(s) that has missing data. My guess is JMP doesn't know how to deal with those rows in the validation column because there is missing data.

 

  Do you plan on modeling an outcome where you have missing data? If not, then you could consider excluding those rows where you have missing data -- this is essentially what you are doing with the filter column, but by excluding those rows, you don't need a new filter column for each validation column you test. You can also easily set the row state as Exclude in JSL so that it's automated and you don't have to do it manually.

 

  If you still need to model the data where you have missing data, then you might consider not stratifying on any columns and just randomly generating training/validation columns. JMP will assign each row to the proper data set even if there is missing data.

 

Hope this helps!,

DS

 

Victor_G
Super User

Re: Error message for Neural Net With Missing Validation Rows - Suggest Workaround?

Hi @c_blanken,

 

Just to add some comments/remarks to the great answer from @SDF1.

  • Why do you have some rows that have missing values regarding the validation column ? Were they added after the column creation ?
  • I'm not surprised to see that platform "Bootstrap Forest" is able to deal with missing values in the validation column, since this model use bootstrap samples of the training data to fit many decision tree. Neural Networks and other methods use the data "as it is", and so you may encounter the warning message you get.

 

Instead of creating several validation columns to assess the influence of the training set variability on the model performances, there might be (at least) 2 other interesting options for you :

 

          1. Create a "formula" Validation Column (jmp.com) instead of "fixed" validation column : 

Victor_G_0-1677225919611.png

This approach has two benefits :

  • Even in the case of stratified/grouping validation, it will enable to automatically add training/validation or test label in the validation column for new rows added, depending on the stratification/grouping method and the ratio between these sets that you have used.
  • In the modeling, you can also more easily try different resampling between training and validation sets, by right-clicking on the performance metrics you would like to evaluate with various training and validation sets, click on "Simulate" and here you will have the option to switch in and out your validation column :

Victor_G_1-1677226282870.png

By doing this you'll get a datatable with all the simulations done on various training and validation sets, and you can plot the performance variability obtained with these simulations (here for R² on the toy dataset "Boston Housing" with Random Forest):

Victor_G_2-1677226431627.png

 

          2. Use the Model Screening Platform (jmp.com) platform with only the model you want selected and the validation method you want (K-folds, Nested K-folds, with the option to repeat them, or validation column), and launch it. You'll get access to the results summary, but also the individual folds results :

Victor_G_3-1677226809800.png

When selecting for example one fold to run the model (here I tried with fold3), a new column will also appear in your datatable, in order to know which rows in this fold were used as training or validation (0: Training, 1: Validation) : 

Victor_G_4-1677226989882.png

 

I think these methods may be useful for you, with less manual work and greater flexibility.

Victor GUILLER
Scientific Expertise Engineer
L'Oréal - Data & Analytics