Discussions

mjz5448 · Jun 1, 2025 09:57 PM

I've seen several JMP webinars where they use Bootstrap forest to find the VIP variables, typically followed by Elastic net or Stepwise regression to further whittle down terms that might be important for future DOEs?

I'm curious if anyone uses PLS regression in the process - maybe after Bootstrap forest, to further ID important or correlated variables before doing Elastic Net etc.., or does the fact that Elastic net keep/remove sets of correlated variables negate the need for this?

Just wondering if anyone uses this in their process? I've seen a few webinars where PLS is used to mine biological process data?

Ben_BarrIngh · Jun 2, 2025 04:02 AM

Hi @mjz5448 ,

As with all statistical approaches the simplest answer is 'it depends'. You are right that the fact that Elastic Net already performs variable selection efficiently so you may not need/want to apply PLS on top of that - the other advantage is that the model coefficients in EN's are directly interpretable, so you can directly relate your predictors to your response variable. You also need to consider the purpose of PLS, which is primarily there to produce a predictive model, not an explanatory one (although you could do this, albeit with an added level of complexity due to the loading and linear combinations of all of the factors), which in the case of selecting DOE factors you want to try and use an approach better suited to explanatory v. predictive modelling.

Hope that helps, I'm sure others have some valuable discussion to add.

Thanks,

Ben

“All models are wrong, but some are useful”

View solution in original post

P_Bartell · Jun 2, 2025 07:05 AM

Years ago when I worked at a large chemical company that was very good at coating materials on a substrate (think photographic film and paper) we used PLS to help inform designed experiments all the time for the following reasons:

1. Very small number of runs in manufacturing with lots of input factors. Think wide and short data sets. Tailor made for PLS.

2. Lots of multcollinearity between the factors.

3. True process understanding has been lost to the ages...some of these products we've been making for scores of years and the usual mantra in manufacturing is 'just make it like last time'...because we really don't know how it all works.

4. Penalized regression/tree based variable selection methods were not yet in popular use in software.

5. And the thing that motivates the whole endeavor is "The magic of physics, chemistry or biology didn't come together last night when we tried to make (fill in the blank). And we have no idea of root cause(s)."

So we used PLS to try and identify key factors, then go into pilot/small scale facilities to run designed experiments as a means of gaining process understanding so we could identify root causes of systemic failure modes. Two of the engineers I worked with had some sayings that resonated with me (as the statistician on many of these problem solving teams)..."Sometimes knowing and 'is not' is just as valuable as knowing and 'is'. And the other was, "Until you can turn a failure mode on and off...you don't know root cause."

In short it was the combination of PLS and DOE that got us what we were after. Hope this helps?

View solution in original post

Ben_BarrIngh · Jun 2, 2025 04:02 AM

Hi @mjz5448 ,

As with all statistical approaches the simplest answer is 'it depends'. You are right that the fact that Elastic Net already performs variable selection efficiently so you may not need/want to apply PLS on top of that - the other advantage is that the model coefficients in EN's are directly interpretable, so you can directly relate your predictors to your response variable. You also need to consider the purpose of PLS, which is primarily there to produce a predictive model, not an explanatory one (although you could do this, albeit with an added level of complexity due to the loading and linear combinations of all of the factors), which in the case of selecting DOE factors you want to try and use an approach better suited to explanatory v. predictive modelling.

Hope that helps, I'm sure others have some valuable discussion to add.

Thanks,

Ben

“All models are wrong, but some are useful”

P_Bartell · Jun 2, 2025 07:05 AM

Years ago when I worked at a large chemical company that was very good at coating materials on a substrate (think photographic film and paper) we used PLS to help inform designed experiments all the time for the following reasons:

1. Very small number of runs in manufacturing with lots of input factors. Think wide and short data sets. Tailor made for PLS.

2. Lots of multcollinearity between the factors.

3. True process understanding has been lost to the ages...some of these products we've been making for scores of years and the usual mantra in manufacturing is 'just make it like last time'...because we really don't know how it all works.

4. Penalized regression/tree based variable selection methods were not yet in popular use in software.

5. And the thing that motivates the whole endeavor is "The magic of physics, chemistry or biology didn't come together last night when we tried to make (fill in the blank). And we have no idea of root cause(s)."

So we used PLS to try and identify key factors, then go into pilot/small scale facilities to run designed experiments as a means of gaining process understanding so we could identify root causes of systemic failure modes. Two of the engineers I worked with had some sayings that resonated with me (as the statistician on many of these problem solving teams)..."Sometimes knowing and 'is not' is just as valuable as knowing and 'is'. And the other was, "Until you can turn a failure mode on and off...you don't know root cause."

In short it was the combination of PLS and DOE that got us what we were after. Hope this helps?

Victor_G · Jun 2, 2025 6:48 AM

Hi @mjz5448,

I completely agree with @P_Bartell about the use of PLS, mostly in situations involving collinearity/correlation between inputs and outputs, where the number of input variables is higher than the number of experiments. Partial Least Squares is used to develop models using correlations between Ys and Xs.
PLS is typically a model often used in chemometrics for analyzing spectral data, as you have a very high number of correlated Xs (intensity for each wavelength for example) and few observations/experiments to predict some responses. In this case, you leverage correlations between Xs to obtain a good predictive model through the use of PLS (with an appropriate validation and testing methodology). Explanatory/interpretability is possible with this model type, and in the context of chemometrics you could use the Variable Importance Plot and other graphs to identify important wavelengths bands/area related to specific chemical function/group/molecules. The use of Profiler available in this platform also help visualizing interactively the importance of the variables, and you can also calculate Sobol sensitivity indices in the Profiler thought the option "Assess Variable Importance".

The use of any modeling method should be linked to the level of prior knowledge, your objectives, and your experimental context.
In very early screening stages with few prior knowledge and few experimental runs, the objective is often to identify important effect, no matter the precision of the effects estimates (the effects could then be confirmed and precised by collecting informative data through DoE). In such situation, penalized regression (Lasso, Ridge, Elastic Net, Dantzig selector) is often useful to identify important effect, at the price of increasing bias in effects estimates. Note that the use of different penalization methods differ depending on your assumptions about effects sparsity : Lasso will do variable selection by enforcing estimates at 0 for correlated variables, which can be good if you expect a small number of important predictors among a large number of candidates. A less "aggressive" approach would be Ridge, reducing the size of correlated effects without making any estimates at 0 (in the case of multiple correlated predictors of similar importance), or Elastic Net estimation, which combines Lasso and Ridge penalizations.

If you have more observations/experiments, you could try other feature selection methods that do not rely on bias/penalization, like Random Forest or using it as a Predictor Screening option, by using Feature Importances (Columns contribution) to identify the most important/active features. You could also try to use the Predictor Selection Assistant add-in that helps selecting variable based on an upgraded version of the Predictor Screening that uses Boruta Feature selection method and Random Forest.

You can also check the related conversations in these topics:

Multiple regression with correlated variables

What is the best JMP platform for feature selection?
Is there an equivalent to multicollinearity for categorical variables in JMP?
My recommandation would be to test, compare and (eventually) combine the different modeling approaches above (and more), and compare the obtained models to see which terms are included and how similar/different the models are. This could greatly help you in understanding what matters most.

Hope this answer will help you,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

statman · Jun 3, 2025 1:23 PM

There is no correct tool or one correct sequence of tools that will provide insight to all. No doubt each of us has our bias as to what tool is most effective or efficient given the situation. Regardless, the answer does not lie in the tool. The key is critical thinking which requires iteration.

One might start with science or engineering based hypotheses (e.g., thermodynamics, entropy). If you have no hypotheses, then perhaps "data mining" can be useful. I use data mining quite liberally. Basically looking at data to see patterns (or lack there of). This, of course, is best done graphically.

There are many forms of regression that attempt to quantify relationships (e.g., OLS, stepwise, PCR, ridge, PCA, PLS, etc.). If they implore the investigator to explain the results, this can be quite useful (e.g., why does this factor seem significant or why is this one not). This is the development of hypotheses (perhaps to be evaluated in designed experiments).

Just some words of caution using historical or observational data.

1. If there is a factor that has been deemed critical at some time in the past (by whatever means), there might be controls on this factor that do not permit it to vary much. Analysis of said factor might show it to be insignificant in any regression analysis (it didn't vary much).

2. Hidden confounding may be present (x's that were not identified or labeled in the data set.

3. The ability to model what happened, does not mean we can predict what will happen.

4. Extrapolation of the model outside the range of collected x’s, without understanding, could be hazardous.

5. Regression used with historical data does not consider the context under which the data was acquired (e.g., do you know the measurement uncertainty?)

6. Can be easily affected by "bad" data points.

"All models are wrong, some are useful" G.E.P. Box

Discussions

Do people ever use PLS on historical data for variable selection to find important variables as inputs for DOE?

Re: Do people ever use PLS on historical data for variable selection to find important variables as inputs for DOE?

Re: Do people ever use PLS on historical data for variable selection to find important variables as inputs for DOE?

Re: Do people ever use PLS on historical data for variable selection to find important variables as inputs for DOE?

Re: Do people ever use PLS on historical data for variable selection to find important variables as inputs for DOE?

Re: Do people ever use PLS on historical data for variable selection to find important variables as inputs for DOE?

Re: Do people ever use PLS on historical data for variable selection to find important variables as inputs for DOE?

Recommended Articles