In the attached JMP file, I built three separate regression models for each response variable weight, turning circle and horsepower with three separate validation columns (Training set = 70%, Validation set = 15%, Training set= 15%).
Is it necessary to create validation column for each response variable to build regression models or one validation column with any one response variable is enough?
Hi @madhu,
Happy new Year 2025 !
There are no right or wrong solutions to your question.
Since you're building a predictive model in a Machine Learning (aka data-driven) way for different responses, here are the use of the different sets :
I would consider more useful for "debugging" the use of one validation column for all three responses, to understand how much information can be extracted from the data, and provide a better assessment/comparison of the difficulty of the predictive task depending on the response, using the same training and validation data.
If you change the repartition of observations in your training, validation and test sets with different validation columns, it can be harder to figure out why a certain response is not correctly/precisely predicted : is it because of the stratification/repartition of points used to fit the model ? Or is it because the prediction of this response is more difficult to achieve ?
On a side note, even if you use one validation column, I would recommend creating a stratified validation column formula : the stratification on the features helps ensure the distributions of your training, validation and test sets are similar, preventing any data distribution shifts between the sets that could compromise the generalization of the learning/model fitting of your algorithm. Using a formula validation column type enables you to create simulations and assess robustness of your algorithm with various similar splits of your data.
See Solved: How can I automate and summarize many repeat validations into one output table? - JMP User C...
and Solved: Boosted Tree - Tuning TABLE DESIGN - JMP User Community
You can also use a K-Fold Cross-validation column with a fixed random seed (to ensure reproducibility of model results) to assess robustness and predictive performances of your algorithm on the several responses.
Hope this answer will help you,
Hi @madhu,
Happy new Year 2025 !
There are no right or wrong solutions to your question.
Since you're building a predictive model in a Machine Learning (aka data-driven) way for different responses, here are the use of the different sets :
I would consider more useful for "debugging" the use of one validation column for all three responses, to understand how much information can be extracted from the data, and provide a better assessment/comparison of the difficulty of the predictive task depending on the response, using the same training and validation data.
If you change the repartition of observations in your training, validation and test sets with different validation columns, it can be harder to figure out why a certain response is not correctly/precisely predicted : is it because of the stratification/repartition of points used to fit the model ? Or is it because the prediction of this response is more difficult to achieve ?
On a side note, even if you use one validation column, I would recommend creating a stratified validation column formula : the stratification on the features helps ensure the distributions of your training, validation and test sets are similar, preventing any data distribution shifts between the sets that could compromise the generalization of the learning/model fitting of your algorithm. Using a formula validation column type enables you to create simulations and assess robustness of your algorithm with various similar splits of your data.
See Solved: How can I automate and summarize many repeat validations into one output table? - JMP User C...
and Solved: Boosted Tree - Tuning TABLE DESIGN - JMP User Community
You can also use a K-Fold Cross-validation column with a fixed random seed (to ensure reproducibility of model results) to assess robustness and predictive performances of your algorithm on the several responses.
Hope this answer will help you,
Hello @Victor_G
Thank you for the comments on my query. They make sense. Regards