cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Try the Materials Informatics Toolkit, which is designed to easily handle SMILES data. This and other helpful add-ins are available in the JMP® Marketplace
Choose Language Hide Translation Bar
madhu
Level III

One or more Validation column

In the attached JMP file, I built three separate regression models for each response variable weight, turning circle and horsepower with three separate validation columns (Training set = 70%, Validation set = 15%, Training set= 15%).

Is it necessary to create validation column for each response variable to build regression models or one validation column with any one response variable is enough?

1 ACCEPTED SOLUTION

Accepted Solutions
Victor_G
Super User

Re: One or more Validation column

Hi @madhu,

 

Happy new Year 2025 ! 

 

There are no right or wrong solutions to your question.
Since you're building a predictive model in a Machine Learning (aka data-driven) way for different responses, here are the use of the different sets :

  • Training set : Used for the actual training of the model(s),
  • Validation set : Used for model optimization (hyperparameter fine-tuning, features/threshold selection, ... for example) and model selection,
  • Test set : Used for generalization and predictive performance assessment of the selected model on new/unseen data. At this stage you should not have several models in competition.

I would consider more useful for "debugging" the use of one validation column for all three responses, to understand how much information can be extracted from the data, and provide a better assessment/comparison of the difficulty of the predictive task depending on the response, using the same training and validation data.

If you change the repartition of observations in your training, validation and test sets with different validation columns, it can be harder to figure out why a certain response is not correctly/precisely predicted : is it because of the stratification/repartition of points used to fit the model ? Or is it because the prediction of this response is more difficult to achieve ?

 

On a side note, even if you use one validation column, I would recommend creating a stratified validation column formula : the stratification on the features helps ensure the distributions of your training, validation and test sets are similar, preventing any data distribution shifts between the sets that could compromise the generalization of the learning/model fitting of your algorithm. Using a formula validation column type enables you to create simulations and assess robustness of your algorithm with various similar splits of your data.
See Solved: How can I automate and summarize many repeat validations into one output table? - JMP User C...

and Solved: Boosted Tree - Tuning TABLE DESIGN - JMP User Community

You can also use a K-Fold Cross-validation column with a fixed random seed (to ensure reproducibility of model results) to assess robustness and predictive performances of your algorithm on the several responses.

 

Hope this answer will help you,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

View solution in original post

2 REPLIES 2
Victor_G
Super User

Re: One or more Validation column

Hi @madhu,

 

Happy new Year 2025 ! 

 

There are no right or wrong solutions to your question.
Since you're building a predictive model in a Machine Learning (aka data-driven) way for different responses, here are the use of the different sets :

  • Training set : Used for the actual training of the model(s),
  • Validation set : Used for model optimization (hyperparameter fine-tuning, features/threshold selection, ... for example) and model selection,
  • Test set : Used for generalization and predictive performance assessment of the selected model on new/unseen data. At this stage you should not have several models in competition.

I would consider more useful for "debugging" the use of one validation column for all three responses, to understand how much information can be extracted from the data, and provide a better assessment/comparison of the difficulty of the predictive task depending on the response, using the same training and validation data.

If you change the repartition of observations in your training, validation and test sets with different validation columns, it can be harder to figure out why a certain response is not correctly/precisely predicted : is it because of the stratification/repartition of points used to fit the model ? Or is it because the prediction of this response is more difficult to achieve ?

 

On a side note, even if you use one validation column, I would recommend creating a stratified validation column formula : the stratification on the features helps ensure the distributions of your training, validation and test sets are similar, preventing any data distribution shifts between the sets that could compromise the generalization of the learning/model fitting of your algorithm. Using a formula validation column type enables you to create simulations and assess robustness of your algorithm with various similar splits of your data.
See Solved: How can I automate and summarize many repeat validations into one output table? - JMP User C...

and Solved: Boosted Tree - Tuning TABLE DESIGN - JMP User Community

You can also use a K-Fold Cross-validation column with a fixed random seed (to ensure reproducibility of model results) to assess robustness and predictive performances of your algorithm on the several responses.

 

Hope this answer will help you,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)
madhu
Level III

Re: One or more Validation column

Hello @Victor_G 

Thank you for the comments on my query. They make sense. Regards