Solved: cross validation using k-fold fit quality

daniel_s · Mar 26, 2025 12:53 PM

I am using Lasso fit with leave one out or K-fold cross validation. Please advise the best way to view the R square and other fit quality metrics (e.g., AIC) in the output. It would be helpful to have this for the training set and the validation set (average of all hold outs).

Victor_G · Mar 27, 2025 1:39 AM

Hi @daniel_s,

After having launched your LASSO model with K-folds crossvalidation, there are indeed several ways to proceed with the results :

Choose the best performing LASSO model on the "best" validation fold : Not recommended as this would look like "cherry picking" and not an "honest assessment" and selection procedure : it's more like selecting the right data for the model, instead of fitting the right model to your data, so you might end up overfitting your validation data.
After assessing results consistency and robustness, retrain the model on all data : This approach could seem logical, as once you have assessed that your model is robust and have similar results across all folds, you could be tempted to use all data to further improve your model. It could be a viable option if you're sure that model's parameters (for example, terms included and penalty value) could be kept the same between fitting with crossvalidation data and fitting with all data, to ensure your model won't be overfitting on the whole dataset. This approach has the drawback to lose sight of the model validation, so if anything goes wrong on the test data, it's hard to debug the model without validation data.
Create model averaging of your K models : This is my preferred approach (if possible). Once you have your K models, you can run each model and save their prediction formula using "Publish Prediction Formula" to store your models in the Formula Depot :

Once the models formula in the Formula depot, you can click on the red triangle next to "Formula Depot" and select "Model Comparison". This will create a short summary of the performances of your model, and if you click on the red triangle next to "Model Comparison", you can create a Model Averaging :

This option will create a new formula in your datatable, that corresponds to the average equation of your K models (in my case, the average equation/model of my 5 individual crossvalidated models) :

You can then compare once again the performances of your K individual models and your average model using the same Model Comparison platform.
Note that this approach may not be easy/feasible to do if you have a large number of folds, and/or if the models used are complex (like Neural Networks). For simple models like regression models and Machine Learning "base models" (like Decision Tree, SVM, kNN, ...), this approach helps avoid overfitting and ensure robustness and generalization properties, without "losing" any data in validation.

You can read more about crossvalidation in the following posts :
CROSS VALIDATION - VALIDATION COLUMN METHOD

k-fold r2

I also highly recommend the playlist "Making Friends with Machine Learning" from Cassie Kozyrkov to learn more about models training, validation and testing : Making Friends with Machine Learning - YouTube

Hope this response will help you and answer your questions,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

View solution in original post

Victor_G · Mar 26, 2025 01:02 PM

Hi @daniel_s,

Welcome in the Community !

To get performance metrics of your LASSO regression on the individual folds and on average, I think it's easier to launch the Generalized regression from the Model Screening platform, by checking only the options "Generalized Regression" and "Additional Methods", specifying the type of terms that can enter the model (for example introducing interactions and quadratic effects), the number of folds and the seed for reproducibility (if needed) :

Once the platform is launched, you'll have a new window open with all infos about individual folds and summary :

Hope this answer will help you,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

daniel_s · Mar 26, 2025 08:30 PM

Thank you. That is helpful. I gather that I then would select the best fold and "run selected" in which you will then get the lasso fit results with the best K fold used as a validation

Victor_G · Mar 27, 2025 1:39 AM

Hi @daniel_s,

After having launched your LASSO model with K-folds crossvalidation, there are indeed several ways to proceed with the results :

Choose the best performing LASSO model on the "best" validation fold : Not recommended as this would look like "cherry picking" and not an "honest assessment" and selection procedure : it's more like selecting the right data for the model, instead of fitting the right model to your data, so you might end up overfitting your validation data.
After assessing results consistency and robustness, retrain the model on all data : This approach could seem logical, as once you have assessed that your model is robust and have similar results across all folds, you could be tempted to use all data to further improve your model. It could be a viable option if you're sure that model's parameters (for example, terms included and penalty value) could be kept the same between fitting with crossvalidation data and fitting with all data, to ensure your model won't be overfitting on the whole dataset. This approach has the drawback to lose sight of the model validation, so if anything goes wrong on the test data, it's hard to debug the model without validation data.
Create model averaging of your K models : This is my preferred approach (if possible). Once you have your K models, you can run each model and save their prediction formula using "Publish Prediction Formula" to store your models in the Formula Depot :

Once the models formula in the Formula depot, you can click on the red triangle next to "Formula Depot" and select "Model Comparison". This will create a short summary of the performances of your model, and if you click on the red triangle next to "Model Comparison", you can create a Model Averaging :

This option will create a new formula in your datatable, that corresponds to the average equation of your K models (in my case, the average equation/model of my 5 individual crossvalidated models) :

You can then compare once again the performances of your K individual models and your average model using the same Model Comparison platform.
Note that this approach may not be easy/feasible to do if you have a large number of folds, and/or if the models used are complex (like Neural Networks). For simple models like regression models and Machine Learning "base models" (like Decision Tree, SVM, kNN, ...), this approach helps avoid overfitting and ensure robustness and generalization properties, without "losing" any data in validation.

You can read more about crossvalidation in the following posts :
CROSS VALIDATION - VALIDATION COLUMN METHOD

k-fold r2

I also highly recommend the playlist "Making Friends with Machine Learning" from Cassie Kozyrkov to learn more about models training, validation and testing : Making Friends with Machine Learning - YouTube

Hope this response will help you and answer your questions,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

daniel_s · Mar 27, 2025 10:49 AM

Thank you! Your insights were just right for the Machine learning trajectory I am on.

daniel_s · Apr 7, 2025 04:18 PM

Just to complete my analysis, if I use average model from K fold model, as described above, I would like to review importance of each model independent variables. I would typically use a VIF. Please advise how would I do that on the average formula?

Thank you.

Victor_G · Apr 7, 2025 3:14 PM

Hi @daniel_s

I'm not sure to follow you, VIF values are helpful to assess the degree of collinearity among your variables, but it is not reflecting the importance of these variables on the response.

You can compare the parameters estimates from the average model formula to compare importance of factors.

You can also load the average formula in the Profiler (available in menu Graph, check "Expand intermediate formula" to make sure your response will be linked to your original predictors), and use the "Assess Variable Importance" option to calculate variable importances (in terms of main effects and total effects) based on Sobol indices.

Hope this answer will help you,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

daniel_s · Apr 7, 2025 06:38 PM

Thanks! I still have more to learn, and your response gives me a platform to continue.

cross validation using k-fold fit quality

Re: cross validation using k-fold fit quality

Re: cross validation using k-fold fit quality

Re: cross validation using k-fold fit quality

Re: cross validation using k-fold fit quality

Re: cross validation using k-fold fit quality

Re: cross validation using k-fold fit quality

Re: cross validation using k-fold fit quality

Re: cross validation using k-fold fit quality

Recommended Articles