Re: Model: Are high VIF an issue in all cases?

Elofar · Dec 18, 2019 09:35 AM

Here's the situation: I am trying to build a model to express a particular Y using 25 parameters in X. All the data from my parameters X is given by the same single technology, which is analyzing 1 sample using 25 frequencies (my X), as a frequency scanning to gain maximum information. In other word, for each single Y data, I will have 25 X.

But the issue is that when I build models, I obtain huge VIFs because my parameters X are inherently collinear as they are just frequencies of the same technology so they evolve the same way ...

And if I do as I learned, meaning removing each parameters one by one based on the highest VIF I end up with one single frequency left, which removes all the interest of the scanning technology.

In that particular case, is the model really biaised by these colinearities or can we trust it?

@alexbeck maybe?

statman · Dec 18, 2019 10:03 AM

I don't think I understand the issue completely. It sounds like you have 25 levels of one X (frequency)....so you are creating a model with only one X in the model? You would have many degrees of freedom to estimate a rather complex non-linear model, but there is only one predictor in the equation.

High VIF's (>10 excessive, >5 should be evaluated) are only an indication that your model should be re-evaluated (the models usefulness may be over stated). Yes it is a measure of multicollinearity, but the high values don't necessarily indicate what is collinear. This usually requires subject matter expertise to understand and appropriately compensate for.

"All models are wrong, some are useful" G.E.P. Box

Elofar · Dec 19, 2019 02:23 AM

I'm sorry if it was clear enough, let's try again:
It is not exactly 25 levels of X, but 25 different X coming from the same technology. The techno scans various frequencies and for each, it will give one value, which is my X.
And when I am mentionning high VIF, I am talking about scores of hundreds and thousands ...

Dan_Obermiller · Dec 18, 2019 10:04 AM

This sounds like you might actually want to use the Functional Data Explorer. For each Y you have a range of frequencies that defines a profile. That is exactly what functional data explorer is meant to analyze.

However, if you want to use regression, high VIFs are not really much of an issue if you do NOT remove any terms from the full model (the model with all of the X's in the model) and you are interested in just using the model for prediction.

Since it sounds like you are more interested in which frequencies are most important (by removing non-significant terms with high VIFs), you will still be running the risk of the high VIFs causing issues. Your approach is probably the best if you stick with regression, but remember that the high VIFs mean that the variance on the parameter estimates is inflated. That also means that the parameter estimates are terribly unstable. Removing one X from the model could dramatically alter the parameter estimates and significance testing of the other parameters. The model you end up with may not be the best.

If you really need to identify the important features of these profiles, please look at functional data explorer. If that is not option, you should consider looking at a modeling technique that is designed to handle the multicollinearity. PLS or PCR are just two such techniques.

Dan Obermiller

Elofar · Dec 19, 2019 02:26 AM

I have no clue what if that tool Functional Data Explorer, I'll find out thanks a lot for this tip !

Indeed, within all my 25 frequencies, only few are really relevant to explain my Y, and therefore during modeling I used stepwise to remove the non-significant ones. If I don't, I will obtain a model with like 18 non-significant parameters and only 6 signficant ... Does that make really sense?

I'll also check PLS indeed that's a good idea, thank you for your help!

Mark_Bailey · Dec 18, 2019 10:05 AM

I have seen a common "rule of thumb" for VIF: below 10 is good, above 10 is bad. This rule is unfortunately useless and should not be followed.

In general, a tolerable VIF depends on the relative size of the parameter estimate and the response variance. If your response changes are huge and your variance is small, you can tolerate very large VIF. On the other hand, if your response changes are small and your variance is large, then a small VIF might be too much.

I wonder if you could use another regression method. For example, PLS is very successful where the X are the domain of some spectra, like wavelength or molecular weight. These X are highly correlated. The PLS regression model exploits this information instead of penalizing you with collinearity.

If you have JMP Pro, then you could also treat the X as functional data and save the functional principal components. These FPCs would be used in place of X in your regression model of Y.

What do you think?

statman · Dec 18, 2019 12:47 PM

As Mark indicates, "Rules of Thumb" should always be used with caution. In building an appropriate model, there are a number of techniques used (e.g., assessing the assumptions (NID(0, s2)), R-square-R-square Adj, testing for outliers, residuals analysis, etc.) of which VIF's is only one.

"All models are wrong, some are useful" G.E.P. Box

Elofar · Dec 19, 2019 02:30 AM

Absolutely. Everything in the model is quite OK: Outliers were checker with studentized residuals, Adj. R2 is very fine etc ... Just the VIF are crazy

Elofar · Dec 19, 2019 02:28 AM

I forgot to mention but when I speak about high VIF, I meant scores of hundreds and thousands ... If still my variance is not that high, can I tolerate such values?
The PLS was previously mentionned indeed, I'll try that!
Unfortunately I don't have JMP Pro ...

cwillden · Dec 18, 2019 03:33 PM

There's a lot of techniques you could use in this situation. Regularized models like ridge regression or neural nets can overcome collinearity pretty well. You can also use dimensionality reduction techniques like principle components and do regression on the component scores.

-- Cameron Willden