Hi All,
I need some help with regards to multiple regression and would really appreciate any help! Thank you in advance :)
Here is a problem in a nutshell –
Look forward to hearing form you.
Best Regards,
Vipul
Hello @vipul,
Maybe using the "Stepwise" personality in the platform "Fit Model" does correspond to your need, since JMP will try to use variables that tend to create a model with minimum BIC/AICc, or with a p-value threshold. By clicking on the red triangle of the platform, there is also the option to test "All Possible Models", where you have to specify the highest number of terms (factors) in your models, and the number of best models to see. Then JMP will provide a table with all models tested, with number of terms in each model, R², RMSE, AICc and BIC, and you can run any of the models tested if you find some that are interesting enough.
However, I don't think this may be a good idea with the problem of multicollinearity in your dataset. If you have JMP and a dataset with highly correlated variables, you can try these options, depending if you are interested in explainability and/or predictability :
I think these options may be more robust/reliable than trying to fit all possible models and choosing the best one(s) depending on R² or other metrics, since the original approach you mentioned will drop some of the correlated X's that may still contain valuable information.
There are many other options possible, but available in JMP Pro (with penalized regression model techniques for example).
I think these first three options may already be helpful for you :)
Hi @Victor_G
Thank you for taking the time in answering. I'll be sure to use the predictor screening module right away. Hoping that it gives some leads.
However, do you think this will solve the multicollinearity issue since (say) X1 and X2 will come up as most likely predictors again, right?
Hi @vipul,
Different models will lead to different approaches and ways of dealing with multicollinearity.
Considering X1 and X2 a pair of predictors highly correlated among other predictors, the models will react differently :
The method you'll choose depends on the performance you can have with the models, but more important, depends on the domain knowledge you have on the topic. Different modeling may lead to different explanations and different hypothesis about significance or importance of different predictors, so the domain knowledge should guide you to select the most appropriate/reasonable model.
Hope it helps you,
Thank you again, Victor. This was very helpful feedback. I tried the Stepwise Regression and PCA approach and that already helped us a lot.
Thank you once again.
Best Regards,
Vipul
Here are my thoughts: There are no statistical procedures to pick which factors to include in your study when multicollinearity is present. The decision must be one made by the SME (Subject Matter Expert). Things that may influence this include:
1. Ease or cost of managing the X's
2. Scientific basis (hypotheses) for inclusion or not (I suggest these be predicted rather than developed after the data is gotten)
3. Impact on other Y's
Now, there are other things you can do:
1. Combine correlated x's into one X
2. Develop alternate response variables
3. Try different transformations (e.g., PCA) although this may cloud the scientific relationships
4. Run experiments on the factors
Hi @statman,
Thank you for taking the time to share your perspective. Could you maybe please elaborate on how you combine multiple correlated X's?
Thank you again in advance.
Best Regards,
Vipul
David has given some platforms to try. The combining is not a statistical procedure, but an engineering/scientific procedure.
Let me give you a hypothetical example:
Imagine you are trying to determine what factors have an effect on the Distance a projectile can be launched using a rubber band (Y=distance). Two of the many variables of interest (mass of projectile, geometry of projectile, angle of launch, ambient air currents, release technique, etc.) are:
1. X1 = Length the rubber band is stretched and
2. X2 = "Spring Constant" which is varied by changing the width of the rubber band
These two variables appear correlated in a data set and both have a direct effect on the amount of energy supplied to the system (hypothesis) Let's consider them collinear. Instead of treating them as independent variables in an experiment, one might combine them into one factor whose levels consist of:
- = short length of stretch and narrow width band
+ = long length of stretch and thick width band
And then this "combined factor" compared with many others in an experiment. Should this factor appear significant, subsequent experiments can be run to further understand how to most effectively manage the energy into the system (this becomes a new response variable).
Thank you @statman for sharing your thoughts. This was very helpful. Much appreciated :)
Best Regards,
Vipul
You might also like to look at the Cluster Variables utility under Analyze>Clustering. The X's that are correlated will be placed in the same cluster. Then, for each cluster, you can select a 'representative variable' to include in a regression. There is even a 'launch fit model' option from the red triangle.