I am a new student using JMP and have a question.
I have a dataset with several variables that are heavily skewed as evidenced by their distributions. One is my dependent variable and the others are all independent variables.
I create new columns using the Log_variable for the dependent variable and one of the primary independent variables.
Both of their distributions improved significantly towards a normal distribution.
As mentioned above, I have other independent variables that are also heavily skewed and am considering converting them to logs.
My question is, is it always advisable to convert the heavily skewed independent variables to log_variable?
Or should this be used minimally?
FYI - I will run a multiple regression analysis on the variables once I have them "prepared" (i.e. address the missing values and potential outliers, created log_variables).
Thanks,
JK
The requirement of normality applies to the conditional distribution of the response, not the overall or marginal distribution. It does not apply to the predictor variables at all. Some of the skew might be non-random and explained by the model! For that reason we assess the residuals. We expect them to be normally distributed, independent, and have constant variance. There are diagnostic tools in the Fit Least Squares platform to help you assess the issue.
There is also a tool to assess the benefit of a transformation. You evaluated the log transformation. A generalization of the power transforms is the Box-Cox Y Transformation available in the Fit Least Squares red triangle menu under profiling.
With the power of JMP and ease of use...my general recommendation regarding predictor or response transformations in a regression setting is unless you have some domain expertise that suggest transformation is a good idea ahead of time, why muddy the waters? There's a principle I tried to follow when modeling; "The simplest model that answers the original practical problem is the 'best' model."
It's a simple enough task to start with untransformed variables, then use all the model diagnostics tests and visualizations in JMP to see how well the untransformed model is performing. Then use that knowledge to guide the transformation process, if needed at all wrt to solving the original problem...I think this is congruent with @Mark_Bailey 's thinking?
I am confused. How did you take the log of the response if it has non-positive values? Why can't you use the Box-Cox Y Transformation command if your response ranges from $1M to $40B?
It is difficult to compare regression statistics when the responses are different (Y versus Log Y).
You can try transfoming X variables, too.
Data that varies more than a couple orders of magnitude generally benefit from such a transformation.
You decide that the choices were helpful based on model bias and variance, parsimony, valid model assumptions, and empirical validation or cross-validation.
Is this model for predictive or explanatory purposes?