cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
The Discovery Summit 2025 Call for Content is open! Submit an abstract today to present at our premier analytics conference.
Get the free JMP Student Edition for qualified students and instructors at degree granting institutions.
Choose Language Hide Translation Bar
View Original Published Thread

cross validation using k-fold fit quality

daniel_s
Level I

I am using Lasso fit with leave one out or K-fold cross validation. Please advise the best way to view the R square and other fit quality metrics (e.g., AIC) in the output. It would be helpful to have this for the training set and the validation set (average of all hold outs).

 

1 ACCEPTED SOLUTION

Accepted Solutions
Victor_G
Super User


Re: cross validation using k-fold fit quality

Hi @daniel_s,

 

After having launched your LASSO model with K-folds crossvalidation, there are indeed several ways to proceed with the results :

  1. Choose the best performing LASSO model on the "best" validation fold : Not recommended as this would look like "cherry picking" and not an "honest assessment" and selection procedure : it's more like selecting the right data for the model, instead of fitting the right model to your data, so you might end up overfitting your validation data. 
  2. After assessing results consistency and robustness, retrain the model on all data : This approach could seem logical, as once you have assessed that your model is robust and have similar results across all folds, you could be tempted to use all data to further improve your model. It could be a viable option if you're sure that model's parameters (for example, terms included and penalty value) could be kept the same between fitting with crossvalidation data and fitting with all data, to ensure your model won't be overfitting on the whole dataset. This approach has the drawback to lose sight of the model validation, so if anything goes wrong on the test data, it's hard to debug the model without validation data.
  3. Create model averaging of your K models : This is my preferred approach (if possible). Once you have your K models, you can run each model and save their prediction formula using "Publish Prediction Formula" to store your models in the Formula Depot
    Victor_G_0-1743062832794.png

    Once the models formula in the Formula depot, you can click on the red triangle next to "Formula Depot" and select "Model Comparison". This will create a short summary of the performances of your model, and if you click on the red triangle next to "Model Comparison", you can create a Model Averaging :

    Victor_G_1-1743063059364.png

    This option will create a new formula in your datatable, that corresponds to the average equation of your K models (in my case, the average equation/model of my 5 individual crossvalidated models) :

    Victor_G_2-1743063269688.png

    You can then compare once again the performances of your K individual models and your average model using the same Model Comparison platform. 
    Note that this approach may not be easy/feasible to do if you have a large number of folds, and/or if the models used are complex (like Neural Networks). For simple models like regression models and Machine Learning "base models" (like Decision Tree, SVM, kNN, ...), this approach helps avoid overfitting and ensure robustness and generalization properties, without "losing" any data in validation.

 

You can read more about crossvalidation in the following posts : 
CROSS VALIDATION - VALIDATION COLUMN METHOD 

k-fold r2 

I also highly recommend the playlist "Making Friends with Machine Learning" from Cassie Kozyrkov to learn more about models training, validation and testing : Making Friends with Machine Learning - YouTube

 

Hope this response will help you and answer your questions,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

View solution in original post

4 REPLIES 4
Victor_G
Super User


Re: cross validation using k-fold fit quality

Hi @daniel_s,

 

Welcome in the Community !

 

To get performance metrics of your LASSO regression on the individual folds and on average, I think it's easier to launch the Generalized regression from the Model Screening platform, by checking only the options "Generalized Regression" and "Additional Methods", specifying the type of terms that can enter the model (for example introducing interactions and quadratic effects), the number of folds and the seed for reproducibility (if needed) :

Victor_G_0-1743008501964.png

Once the platform is launched, you'll have a new window open with all infos about individual folds and summary :

Victor_G_1-1743008541354.png

 

Hope this answer will help you,

 

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)
daniel_s
Level I


Re: cross validation using k-fold fit quality

Thank you. That is helpful. I gather that I then would select the best fold and "run selected" in which you will then get the lasso fit results with the best K fold used as a validation 

Victor_G
Super User


Re: cross validation using k-fold fit quality

Hi @daniel_s,

 

After having launched your LASSO model with K-folds crossvalidation, there are indeed several ways to proceed with the results :

  1. Choose the best performing LASSO model on the "best" validation fold : Not recommended as this would look like "cherry picking" and not an "honest assessment" and selection procedure : it's more like selecting the right data for the model, instead of fitting the right model to your data, so you might end up overfitting your validation data. 
  2. After assessing results consistency and robustness, retrain the model on all data : This approach could seem logical, as once you have assessed that your model is robust and have similar results across all folds, you could be tempted to use all data to further improve your model. It could be a viable option if you're sure that model's parameters (for example, terms included and penalty value) could be kept the same between fitting with crossvalidation data and fitting with all data, to ensure your model won't be overfitting on the whole dataset. This approach has the drawback to lose sight of the model validation, so if anything goes wrong on the test data, it's hard to debug the model without validation data.
  3. Create model averaging of your K models : This is my preferred approach (if possible). Once you have your K models, you can run each model and save their prediction formula using "Publish Prediction Formula" to store your models in the Formula Depot
    Victor_G_0-1743062832794.png

    Once the models formula in the Formula depot, you can click on the red triangle next to "Formula Depot" and select "Model Comparison". This will create a short summary of the performances of your model, and if you click on the red triangle next to "Model Comparison", you can create a Model Averaging :

    Victor_G_1-1743063059364.png

    This option will create a new formula in your datatable, that corresponds to the average equation of your K models (in my case, the average equation/model of my 5 individual crossvalidated models) :

    Victor_G_2-1743063269688.png

    You can then compare once again the performances of your K individual models and your average model using the same Model Comparison platform. 
    Note that this approach may not be easy/feasible to do if you have a large number of folds, and/or if the models used are complex (like Neural Networks). For simple models like regression models and Machine Learning "base models" (like Decision Tree, SVM, kNN, ...), this approach helps avoid overfitting and ensure robustness and generalization properties, without "losing" any data in validation.

 

You can read more about crossvalidation in the following posts : 
CROSS VALIDATION - VALIDATION COLUMN METHOD 

k-fold r2 

I also highly recommend the playlist "Making Friends with Machine Learning" from Cassie Kozyrkov to learn more about models training, validation and testing : Making Friends with Machine Learning - YouTube

 

Hope this response will help you and answer your questions,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)
daniel_s
Level I


Re: cross validation using k-fold fit quality

Thank you! Your insights were just right for the Machine learning trajectory I am on.