Discussions

vipul · Jun 10, 2023 1:51 PM

Hi All,

I need some help with regards to multiple regression and would really appreciate any help! Thank you in advance :)

Here is a problem in a nutshell –

I have continuous Y and 16 X’s
When I run multivariate analysis, I can see some terms are correlated. Say, X1 and X2 are corelated, I can only use one of them

Likewise, I have many such pairs

However, I do want to arbitrarily pick X1 over X2. So, what I am doing right now is running the regression first with X1 and then with X2. Then, I compare the models with these terms. But the number of combinations I would need to run is simply too many, extremely time consuming while doing manually and prone to mistakes.
Wanted to check if there is a smarter way of solving this.

I know this can be solved with writing a script to run this in some kind of a loop and highlight the best models one the basis of –
- R2, adj R2 value and
- VIF
However, do not have the knowledge to write scripts.
Any other tips, tricks would also be great! (I do not have JMP Pro)

Look forward to hearing form you.

Best Regards,

Vipul

Victor_G · Jul 26, 2022 4:44 AM

Hello @vipul,

Maybe using the "Stepwise" personality in the platform "Fit Model" does correspond to your need, since JMP will try to use variables that tend to create a model with minimum BIC/AICc, or with a p-value threshold. By clicking on the red triangle of the platform, there is also the option to test "All Possible Models", where you have to specify the highest number of terms (factors) in your models, and the number of best models to see. Then JMP will provide a table with all models tested, with number of terms in each model, R², RMSE, AICc and BIC, and you can run any of the models tested if you find some that are interesting enough.

However, I don't think this may be a good idea with the problem of multicollinearity in your dataset. If you have JMP and a dataset with highly correlated variables, you can try these options, depending if you are interested in explainability and/or predictability :

Use "Predictor screening" (in Analyze -> Screening) to find the variables that are the most important for your response. Select the n most important ones (you can vary the threshold n, the number of important variables you want to keep depending on domain expertise or arbitrary threshold like "I want the 5 top predictors"), click on "Copy selected", and then run your regression model (Fit Model) with the variables you have copied.
Use Principal Component Analysis, to reduce your intial 16 X's to a lower set of independant variables, containing most of the initial information. Then try a regression model with your principal components instead of the original X's.
Use Partial Least Squares (Analyze -> Multivariate -> Partial Least Squares) to create a model with correlated variables.

I think these options may be more robust/reliable than trying to fit all possible models and choosing the best one(s) depending on R² or other metrics, since the original approach you mentioned will drop some of the correlated X's that may still contain valuable information.
There are many other options possible, but available in JMP Pro (with penalized regression model techniques for example).

I think these first three options may already be helpful for you :)

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

vipul · Jul 28, 2022 10:57 AM

Hi @Victor_G

Thank you for taking the time in answering. I'll be sure to use the predictor screening module right away. Hoping that it gives some leads.

However, do you think this will solve the multicollinearity issue since (say) X1 and X2 will come up as most likely predictors again, right?

Victor_G · Jul 28, 2022 9:53 PM

Hi @vipul,

Different models will lead to different approaches and ways of dealing with multicollinearity.
Considering X1 and X2 a pair of predictors highly correlated among other predictors, the models will react differently :

With a Stepwise regression approach (or Decision Tree), the model will remove highly correlated predictor(s) from the model and only keep the most informative one(s). In your case, you'll end up with only X1 or X2 (depending which one will give you the best modeling/performance).
With a Random Forest (or Predictor Screening for JMP) approach, several Decision trees will be constructed, with a subset of the predictors for each Tree. That means that X1 and X2 will be tested in an equal number of predictors subsets, so they will have the same chance to be picked up in the model. Depending on which one brings more information, one of the two predictors will have an higher feature importance (model contribution) than the other, but the two may be considered in the model.
With a PLS/PCA approach, you reduce the number of predictors to a smaller subset of uncorrelated predictors, so you keep most of predictors information (depending on how many principal components you keep). Your inputs are transformed/combined and uncorrelated (a new X' may be created as a combination of X1 and X2).

The method you'll choose depends on the performance you can have with the models, but more important, depends on the domain knowledge you have on the topic. Different modeling may lead to different explanations and different hypothesis about significance or importance of different predictors, so the domain knowledge should guide you to select the most appropriate/reasonable model.

Hope it helps you,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

vipul · Aug 4, 2022 02:33 AM

Thank you again, Victor. This was very helpful feedback. I tried the Stepwise Regression and PCA approach and that already helped us a lot.

Thank you once again.

Best Regards,

Vipul

statman · Jul 26, 2022 8:51 AM

Here are my thoughts: There are no statistical procedures to pick which factors to include in your study when multicollinearity is present. The decision must be one made by the SME (Subject Matter Expert). Things that may influence this include:

1. Ease or cost of managing the X's

2. Scientific basis (hypotheses) for inclusion or not (I suggest these be predicted rather than developed after the data is gotten)

3. Impact on other Y's

Now, there are other things you can do:

1. Combine correlated x's into one X

2. Develop alternate response variables

3. Try different transformations (e.g., PCA) although this may cloud the scientific relationships

4. Run experiments on the factors

"All models are wrong, some are useful" G.E.P. Box

vipul · Jul 28, 2022 11:08 AM

Hi @statman,

Thank you for taking the time to share your perspective. Could you maybe please elaborate on how you combine multiple correlated X's?

Thank you again in advance.

Best Regards,

Vipul

statman · Jul 28, 2022 11:51 AM

David has given some platforms to try. The combining is not a statistical procedure, but an engineering/scientific procedure.

Let me give you a hypothetical example:

Imagine you are trying to determine what factors have an effect on the Distance a projectile can be launched using a rubber band (Y=distance). Two of the many variables of interest (mass of projectile, geometry of projectile, angle of launch, ambient air currents, release technique, etc.) are:

1. X1 = Length the rubber band is stretched and

2. X2 = "Spring Constant" which is varied by changing the width of the rubber band

These two variables appear correlated in a data set and both have a direct effect on the amount of energy supplied to the system (hypothesis) Let's consider them collinear. Instead of treating them as independent variables in an experiment, one might combine them into one factor whose levels consist of:

- = short length of stretch and narrow width band

+ = long length of stretch and thick width band

And then this "combined factor" compared with many others in an experiment. Should this factor appear significant, subsequent experiments can be run to further understand how to most effectively manage the energy into the system (this becomes a new response variable).

"All models are wrong, some are useful" G.E.P. Box

vipul · Aug 4, 2022 02:35 AM

Thank you @statman for sharing your thoughts. This was very helpful. Much appreciated :)

Best Regards,

Vipul

David_Burnham · Jul 28, 2022 11:12 AM

You might also like to look at the Cluster Variables utility under Analyze>Clustering. The X's that are correlated will be placed in the same cluster. Then, for each cluster, you can select a 'representative variable' to include in a regression. There is even a 'launch fit model' option from the red triangle.

-Dave

Discussions

Multiple regression with correlated variables

Re: Multiple regression with correlated variables

Re: Multiple regression with correlated variables

Re: Multiple regression with correlated variables

Re: Multiple regression with correlated variables

Re: Multiple regression with correlated variables

Re: Multiple regression with correlated variables

Re: Multiple regression with correlated variables

Re: Multiple regression with correlated variables

Re: Multiple regression with correlated variables

Recommended Articles