Re: PLS platform is refitting without validation + how to use the bootstrap?

utkcito · Jun 8, 2023 5:23 PM

Hi,

1) I'm doing a PLS, I have 42 observations. The number of possible predictors ranges from 24 to 79 to 240 depending on how much I preselect variables. My Y is categorical, either 2 or 3. After running the platform launch, I get to the NIPALS/SIMPLS part, and no matter what type of validation I use (Kfold, leave on,e holdout) or how many factors I choose - I always get the error - "best model has 0 factors, refitting without validation". Any suggestion what I can do to work this out? I really want to use PLS, but modeling without any validation is just overfitting...

2) Once the result is obtained, I have the prediction formula. What would be the procedure to bootstrap it? I have a lot of uncertainty regarding the procedure to bootstrap the results; I wish JMP would introduce the bootstrapping into the validation process already. I am assuming that I would need to average the bootstrapped coefficients, but would bootstrapping that table re-run, the whole model inc. k-fold validation (in case it worked)? If so, should I use the centered and scaled data or the original coefficients? I'd have to manually enter the results as a formula, so I'm assuming I should bootstrap the original coefficients, but would that run the model using the centered scaled data and provide the original coefficients, or run the model using the original coefficients? The whole procedure is quite unclear to me from a process perspective i.e. what happens step after step in order to reach the result I want; any help/insight would be appreciated!

Thanks,

Uriel.

SDF1 · Nov 2, 2020 09:30 AM

Hi @utkcito ,

First off, PLS might not be the best platform to use for your data. It might be better to run this as a GenReg or Stepwise to see which predictors are actually important in your model. I do find it very strange that your number of predictors ranges from 24 to up to 240. With so few observations relative to your possible number of predictors, you should really do some predictor screening to make sure that you're not adding in too many unnecessary factors in your model. PLS can handle that, but the predictors also need to covary with the response, otherwise you get that error -- which means there are no linear combinations of your predictors that explain the variation in your response. However, if you must use PLS and have no other option, then you might want to consider changing your response from categorical (2 or 3) to ordinal (2 or 3). It might be able to find a better numerical solution. I also highly recommend looking into and using the autovalidation approach to check on your large list of predictors.

Second, when JMP performs any fitting, I believe in the background it always uses centered and scaled data to get better values of the estimates and then back calculates the actual estimates. So, it doesn't really matter what you end up bootstrapping, however it might make more sense or be more easy to interpret the actual estimate bootstraps rather than the centered and scaled data. And yes, bootstrapping the estimates will re-run the fitting with whichever validation you are doing.

Lastly, for bootstrapping the estimates, you'll right click the column estimates in the JMP report and then select bootstrap. If it's fast, put in maybe 5000, but if it's slow, leave it at 2500. Be sure to check the box " fractional weights. It will re-run the fit multiple times and then give you a distribution of values for each coefficient of the predictors (just run the "distribution" script in the output data table. Then, with the distribution and statistics that show up, you can right click the summary statistics and select "make into combined data table". Be sure to also customize the summary statistics to show the proportion nonzero.

The combined data table will contain the mean and proportion nonzero for your estimates. There you can see what the mean (typical value) is as well as how often (percent-wise) that estimate is nonzero -- if that number is small, then the estimate is often zero. If it's big, then most of the time it is non-zero. This is a good way to determine what terms and mixed terms you need or don't need in your model.

Hope this helps,

DS

utkcito · Nov 2, 2020 7:06 PM

Hi DS,

1) the reason I have 24 to 237 variables is exactly because it depends on how much variable selection I do. 24 is after bootstrap forrest, and then I can do the *2 factorial. 237 is the whole set of cytokines.

2) "predictors also need to covary with the response, otherwise you get that error -- which means there are no linear combinations of your predictors that explain the variation in your response" Can you explain a bit more than that? I thought that the the covariance is among themselves i.e. they are multicolinear, and that PLS uses that covariance between the variables by creating the factors that underlie that co-variance. I'm not sure I understand what you mean when you say that they have to covary with the response. I struggle to think that out of 24, 79 or 237 cytokine variables (they are actually the mean, median and IQR of longitudinal measurements), no linear combination of them can explain any variation of the response (critical illness in the ICU); it has in fact been already established they do so in other settings, so I don't get it why JMP is doing that.

3) Thanks for the explanation of how to operationalize bootstrapping, it makes much more sense and I tried it and it works. I am always surprised how little the formal documentation of JMP helps, how little details of how it works and how to make it work are there, and how much the forums, tutorials, etc do actually help.

Thanks a lot!

Uriel.

SDF1 · Nov 3, 2020 09:06 AM

Hi @utkcito ,

Now I understand what you were originally meaning by "bootstrap" -- running the bootstrap platform is a good approach, but you'll also want to run predictive models with the neural net, GenReg, boosted trees, and even get the XGBoost add-in to compare the different prediction formula on a validation test set. When doing that, you'll want to be sure and tune your settings so that you get the best fit -- the default fits are not always that good.

But, one nice thing about the trees approach is that it is an aggregated approach that uses many different decision trees to come up with a model. If the factors aren't very important in their contribution, they will automatically get pushed down and will not show as having a high contribution to the overall model. Still, due to the random starts and decision process, the bootstrap forest and boosted trees approaches are not particularly stable -- when you run the same settings many hundred times, you'll see that you actually get slightly different answers and formula. That is why it's good to bootstrap the "SS" or "portion" in output report (in bootstrap forest, for example) to get a better estimate of how strong that factor really is in the model.

That being said, you will want to make sure that you use cross validation methods to make sure you don't overfit. And, with a relatively short data table (not that many measurements), it'll be hard to do it without using a leave-one-out approach, unless you use the autovalidation method I mentioned in the previous reply. K-fold cross validation might also be a good approach for you.

I'm not sure what your input data looks like, but it almost sounds like it's what JMP calls "functional data", like spectra or other regularly spaced data where each wavelength (or x-value) is correlated with it's neighbor. You might want to look into using the functional data explorer to generate a better subset of predictors. Or, PCA is also a good approach to reduce down the dimensionality of your predictors and use a smaller set of orthogonal vectors to model your data.

As to your points:

1) Is there a theoretical (biological, chemical, physical) reason to include so many different cross terms from the *2 factorial approach? If you don't have a theoretical reason to include them, then it's not very meaningful and you could be introducing a false predictor.

2) In the PLS platform, it's important to also look at the cumulative Y response under the "Percent Variation Explained" in the report. If the linear combinations of your predictors are not explaining much of the response in Y, your % variation explained will be quite low, indicating your predictors are not doing a particularly great job as a linear combination of explaining the variation in Y, and a different platform will probably work out better. With so few response measurements, trying to fit such a large number of predictors is probably going to lead to overfitting because you don't have enough degrees of freedom. Along with PLS, you might want to consider reviewing the Factor Analysis platform under Multivariate Analysis. I'm thinking that your cytokine variables might be too many, and you lose your degrees of freedom. Imagine having two points and fitting a line, or three and fitting a parabola -- you get a perfect fit because that's exactly the minimum number of points needed to fit those polynomials, but, you lose any degrees of freedom and can no longer estimate errors or fit quality. I'm thinking this is what's happening in the PLS platform.

3) The community pages are the best. I am always checking them, especially for JSL scripts as I code in JMP. Almost always, there is a snippet of code that someone else has solved that is exactly what I need for my work. I also enjoy running through the questions and seeing what I can help with, especially when I get to learn something new.

Best of luck!,

DS