Re: Using XGBoost Model to Predict on a hold-out test set

sukrit2020 · Jun 8, 2023 5:36 PM

I am using JMP 16. I installed the XGBoost addin. I developed a xgbbost model using this. I saved the model in the data table using 'Save Prediction formulae'. Now, I have a separate hold-out test set on which I want to run the formulae to get the prediction. If I click the formulae (see attached figure), I found that the formulae depend on the training and validation set, which seems odd to me. This also hinders me to run the model for the hold-out test set.

statman · Jul 6, 2021 11:59 AM

I'm not sure I know the answer to your question, but it appears you do not have any variables in your model (the column names are validation and training and shrinkage predator validation)? So the formula is a function of those column names. Either re-write the model in terms of variables or re-name the columns of your hold out data set?

"All models are wrong, some are useful" G.E.P. Box

sukrit2020 · Jul 6, 2021 01:20 PM

I added the jmp file. If you look into the XGBoost model, you can see the details. I have 17 variables.

statman · Jul 6, 2021 03:55 PM

Your data set is the same as the one used in the help menu for the XGBoost add-in. Did you follow the steps shown in the application of the platform?

"All models are wrong, some are useful" G.E.P. Box

sukrit2020 · Jul 6, 2021 06:54 PM

Yes. I did. I was not able to still get the formulae.

statman · Jul 7, 2021 03:22 PM

Sorry I haven't had time to look at your data. I'm also not very experienced with the platform, but my guess is you should use the platform to find the significant factors and then re-write your model in fit model and run it and save the prediction formula...but again, I'm not experienced here. Perhaps there is an easier way to get the actual model in terms of the variables in the data table...

"All models are wrong, some are useful" G.E.P. Box

sukrit2020 · Jul 8, 2021 10:47 AM

Can you show me over a webex call?

statman · Jul 9, 2021 4:13 AM

I suggest getting in touch with Russ: XGBoost Add-In for JMP Pro

"All models are wrong, some are useful" G.E.P. Box

russ_wolfinger · Jul 9, 2021 07:43 AM

Hi @sukrit2020 ,

A recommended approach is to append your separate hold-out test set to the main data as new rows with the Y target values set to missing. Then the formula will automatically create predictions on those rows.

Also, it's important to note that XGBoost handles validation differently than other JMP platforms. If the validation column is Nominal, it will automatically do full k-fold with each of the levels. This is why the formula looks like it does. This can be confusing if your validation column has two levels, e.g. "Validation" and "Training". In this case XGBoost would actually do 2-fold, holding out each subset regardless of their values.

A better way to set up your folds is to use the Make K-Fold Columns utility, and then use those as your Validation columns. I feel repeated k-fold is a much better way to validate your model than using a single holdout set. If you really want to only do a single holdout, you must set the type of the Validation column to Continuous, with value 0 corresponding to training and 1 to validation.

sukrit2020 · Jul 10, 2021 12:33 PM

Hi Russ,

Can we set up a webex where you can show me the procedure? I am little new to JMP and thus could not fully follow your instructions.

thanks,

sukrit