I have a maybe philisophical question. I am regressing about 400 data points with ~35 responses, and a possibility of ~800 X's. Since I want to avoid an overfit, I only use all quadratics, then sprinkle in a few "logical" cubics". That's no where new 800 X's, and it does give a 0.99'ish R^2 on most Y's, however, the %error between prediction and data doesn't look so hot (stddev of %error ~10%). Then all I do is transform the Y with Log (natural log) using the same X's, and the R^2 inches up a bit to maybe .994, but the stdev of the %error looks WAY better, ~2%. That's happening to most of my Y's. They love natural log transforms, and I don't know why.
I was wondering if the transform is just taking the place of selecting more or better combinations of cubics? Remember, I didn't add any more terms, just the transform, so I assume I'm no more "overfit" with the transform than without it.
If anyone has had this experience, or knows theoretically why it's happening, or that I'm making a mistake, would love to hear from you.
The natural log transform (or any nonlinear transform) compresses the distance between some of your responses (depending on where they are on the curve) and expands the distance between others. So, if the scale has changed, it isn't surprising that you may find that your error standard deviation is noticeably smaller (or larger).
Transforming Y can change whether or not your model is "overfit". Why? Because the fit has changed, you may need more or fewer variables with a transformed Y compared to the original Y.
Better than just blindly selecting some (or all) Y variables to be transformed to a log scale, you should have a reason for selecting variables to be transformed.