Solved: PLS validation & variable loadings

Moukanni · Jun 8, 2023 5:48 PM

Hello JMP community,

I have a couple of questions about PLS:

- Do I have to hide a portion of my data manually, or is it done automatically when I choose cross-validation?

- Is there a threshold of variables loadings on factors that distinguish the most important variables captured by each factor (latent variable)

Thank you for your assistance!

Victor_G · May 5, 2022 03:04 AM

Hello @Moukanni,

- I am not sure what is your objective behind masking your data manually and/or choosing cross-validation.
If you want a test set (a set not used by the model for training, and not seen during validation) to assess how the PLS results are on "new"/unseen data (and provided you have a large dataset), then yes, you can hide manually a portion of your dataset (hide & exclude the rows, run the model, save the prediction formula, and compare predicted vs. actual responses on this hidden dataset), or if you have JMP Pro, create a validation column (in "Analyze", "Predictive Modeling", "Make Validation Column") where you'll specify the proportion of rows in your training, validation and/or test set.
If you want to validate your model through a K-fold cross-validation, that means JMP will automatically split your dataset in K parts, train the PLS model on K-1 parts, then validate it on 1 part, and repeat this operation so that each "part" (fold) has been one time a validation part and K-1 times a training part. This is a good validation technique if you want to assess the robustness of your model (different training and validation sets compared) on a small dataset.

- Not sure on the second question too, if you want to know which factors are the most important in the PLS model, you can have a look at the variable importance plot and the computed VIP scores. See : Variable Importance Plot (jmp.com) and VIP vs Coefficients Plots (jmp.com)

I hope it will help you !

Victor GUILLER
L'Oréal Data & Analytics

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

View solution in original post

Victor_G · May 5, 2022 03:04 AM

Hello @Moukanni,

- I am not sure what is your objective behind masking your data manually and/or choosing cross-validation.
If you want a test set (a set not used by the model for training, and not seen during validation) to assess how the PLS results are on "new"/unseen data (and provided you have a large dataset), then yes, you can hide manually a portion of your dataset (hide & exclude the rows, run the model, save the prediction formula, and compare predicted vs. actual responses on this hidden dataset), or if you have JMP Pro, create a validation column (in "Analyze", "Predictive Modeling", "Make Validation Column") where you'll specify the proportion of rows in your training, validation and/or test set.
If you want to validate your model through a K-fold cross-validation, that means JMP will automatically split your dataset in K parts, train the PLS model on K-1 parts, then validate it on 1 part, and repeat this operation so that each "part" (fold) has been one time a validation part and K-1 times a training part. This is a good validation technique if you want to assess the robustness of your model (different training and validation sets compared) on a small dataset.

- Not sure on the second question too, if you want to know which factors are the most important in the PLS model, you can have a look at the variable importance plot and the computed VIP scores. See : Variable Importance Plot (jmp.com) and VIP vs Coefficients Plots (jmp.com)

I hope it will help you !

Victor GUILLER
L'Oréal Data & Analytics

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

Moukanni · May 5, 2022 10:12 AM

Thank you, Victor! this helps a lot!

My objective is to validate the model through K-fold cross-validation.

For the second question, I'm referring to the PLS X loadings on each factor; is there a commonly used threshold that highlights the variables belonging to the same system. For example in exploratory factor analysis, variable loadings (> 0.4) on a given factor suggest that these variables highly likely come from the same system.

Thank you so much!

PLS validation & variable loadings

Re: PLS validation & variable loadings

Re: PLS validation & variable loadings

Re: PLS validation & variable loadings