Discussions

madhu · Jan 5, 2025 04:12 PM

In the attached JMP file, I built three separate regression models for each response variable weight, turning circle and horsepower with three separate validation columns (Training set = 70%, Validation set = 15%, Training set= 15%).

Is it necessary to create validation column for each response variable to build regression models or one validation column with any one response variable is enough?

Victor_G · Jan 6, 2025 5:01 AM

Hi @madhu,

Happy new Year 2025 !

There are no right or wrong solutions to your question.
Since you're building a predictive model in a Machine Learning (aka data-driven) way for different responses, here are the use of the different sets :

Training set : Used for the actual training of the model(s),
Validation set : Used for model optimization (hyperparameter fine-tuning, features/threshold selection, ... for example) and model selection,
Test set : Used for generalization and predictive performance assessment of the selected model on new/unseen data. At this stage you should not have several models in competition.

I would consider more useful for "debugging" the use of one validation column for all three responses, to understand how much information can be extracted from the data, and provide a better assessment/comparison of the difficulty of the predictive task depending on the response, using the same training and validation data.

If you change the repartition of observations in your training, validation and test sets with different validation columns, it can be harder to figure out why a certain response is not correctly/precisely predicted : is it because of the stratification/repartition of points used to fit the model ? Or is it because the prediction of this response is more difficult to achieve ?

On a side note, even if you use one validation column, I would recommend creating a stratified validation column formula : the stratification on the features helps ensure the distributions of your training, validation and test sets are similar, preventing any data distribution shifts between the sets that could compromise the generalization of the learning/model fitting of your algorithm. Using a formula validation column type enables you to create simulations and assess robustness of your algorithm with various similar splits of your data.
See Solved: How can I automate and summarize many repeat validations into one output table? - JMP User C...

and Solved: Boosted Tree - Tuning TABLE DESIGN - JMP User Community

You can also use a K-Fold Cross-validation column with a fixed random seed (to ensure reproducibility of model results) to assess robustness and predictive performances of your algorithm on the several responses.

Hope this answer will help you,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

View solution in original post

Victor_G · Jan 6, 2025 5:01 AM

Hi @madhu,

Happy new Year 2025 !

There are no right or wrong solutions to your question.
Since you're building a predictive model in a Machine Learning (aka data-driven) way for different responses, here are the use of the different sets :

Training set : Used for the actual training of the model(s),
Validation set : Used for model optimization (hyperparameter fine-tuning, features/threshold selection, ... for example) and model selection,
Test set : Used for generalization and predictive performance assessment of the selected model on new/unseen data. At this stage you should not have several models in competition.

I would consider more useful for "debugging" the use of one validation column for all three responses, to understand how much information can be extracted from the data, and provide a better assessment/comparison of the difficulty of the predictive task depending on the response, using the same training and validation data.

If you change the repartition of observations in your training, validation and test sets with different validation columns, it can be harder to figure out why a certain response is not correctly/precisely predicted : is it because of the stratification/repartition of points used to fit the model ? Or is it because the prediction of this response is more difficult to achieve ?

On a side note, even if you use one validation column, I would recommend creating a stratified validation column formula : the stratification on the features helps ensure the distributions of your training, validation and test sets are similar, preventing any data distribution shifts between the sets that could compromise the generalization of the learning/model fitting of your algorithm. Using a formula validation column type enables you to create simulations and assess robustness of your algorithm with various similar splits of your data.
See Solved: How can I automate and summarize many repeat validations into one output table? - JMP User C...

and Solved: Boosted Tree - Tuning TABLE DESIGN - JMP User Community

You can also use a K-Fold Cross-validation column with a fixed random seed (to ensure reproducibility of model results) to assess robustness and predictive performances of your algorithm on the several responses.

Hope this answer will help you,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

madhu · Jan 7, 2025 07:25 AM

Hello @Victor_G

Thank you for the comments on my query. They make sense. Regards

Discussions

One or more Validation column

Re: One or more Validation column

Re: One or more Validation column

Re: One or more Validation column

Recommended Articles