topic model screening in Discussions

model screening

maryam_nourmand — Sun, 05 May 2024 10:23:40 GMT

Hello.
The question I have is that I have a dataset, and I want to separate 30% of this dataset for testing. I want to try different models using model screening, but for the validation part of my models, I want all my models to use exactly the same test dataset for validation.
What should I do for this purpose?

Re: model screening

Victor_G — Sun, 05 May 2024 11:09:22 GMT

Hi @maryam_nourmand,

The easiest way to have reproducible results on fixed training/validation/test datasets is to Make Validation Column.

First, go into the corresponding menu for creating validation column :
Then you can specify groups or stratification (if needed) :

You have several options for the method: here I used the default one corresponding to your needs, "Make Validation Column" and I use some variables as stratification columns to make sure all my training, validation and test sets will have a similar distribution/repartition of values for these columns.
Then you can specify the ratio of data for each of your sets, and specify a random seed to have perfect reproducibility in your data splitting. Here I choosed a fixed validation type, but you can also create a formula validation column, enabling to simulate various training, validation and test sets with the same settings (grouping, stratification, etc...) as specified in step 2. Here my data ratios are 70% for training, 20% for validation, and 10% for test (usual values, for more info and reasearch on this topic, you can read the paper "Optimal Ratio for Data Splitting" by V. Roshan Joseph: https://arxiv.org/pdf/2202.03326). It's important to find a good compromise for the ratios depending on the assessment accuracy/fairness and model selection needed, and the number of data/rows available. When you're done, click on "Go" to create this new validation column in your table :
Finally, launch the Model screening platform and use your newly created "Validation column" in the "Validation" panel. You can also fix a random seed at this stage to have reproducible results for the different models and sets used :

Just some clarifications about the terms validation and test, as they may very often be used alternatively with some confusion :

Training set : Used for the actual training of the model(s),
Validation set : Used for model optimization (hyperparameter fine-tuning, features/threshold selection, ... for example) and model selection,
Test set : Used for generalization and predictive performance assessment of the selected model on new/unseen data.

So you can evaluate several models with the validation set, but the test set should be used only once for the model selected in the validation phase.

Hope this answer will help you,

Re: model screening

maryam_nourmand — Sun, 05 May 2024 11:25:23 GMT

thanks

Re: model screening

txnelson — Sun, 05 May 2024 11:29:45 GMT

If you have JMP Pro it has a platform to create a Validation Column, and within the various modeling platforms one can specify the column that indicates which rows are Training rows and which are Validation rows.

If you have standard JMP, you can easily create a Validation column by using the Random Uniform() function in the formula for the Validation column being created. (Make sure you specify the column as being a Character Column)

If(randome uniform()>=.7, "Validation", "Training")

After the column is created, you will want to go to the Column Info screen, and remove the formula which will make the column's values static, insuring that the formula is not run a second time, changing the values.

You can then Exclude and Hide the Validation rows while you do your model building, and then when you want to validate the mode, un Hide and Exclude the Validation rows and Hide and Exclude the Training rows and then rerun the model.