The World Statistics Day celebration continues here in the Community. We all need reliable data for sound decision making. Do you have a data source that you trust most? Head over to Discussions to tell us about it.
Choose Language Hide Translation Bar
Highlighted
Level I

## Using log of variables in linear regression?

I am a new student using JMP and have a question.

I have a dataset with several variables that are heavily skewed as evidenced by their distributions.  One is my dependent variable and the others are all independent variables.

I create new columns using the Log_variable for the dependent variable and one of the primary independent variables.

Both of their distributions improved significantly towards a normal distribution.

As mentioned above, I have other independent variables that are also heavily skewed and am considering converting them to logs.

My question is, is it always advisable to convert the heavily skewed independent variables to log_variable?

Or should this be used minimally?

FYI - I will run a multiple regression analysis on the variables once I have them "prepared" (i.e. address the missing values and potential outliers, created log_variables).

Thanks,

JK

4 REPLIES 4
Highlighted
Staff

## Re: Using log of variables in linear regression?

The requirement of normality applies to the conditional distribution of the response, not the overall or marginal distribution. It does not apply to the predictor variables at all. Some of the skew might be non-random and explained by the model! For that reason we assess the residuals. We expect them to be normally distributed, independent, and have constant variance. There are diagnostic tools in the Fit Least Squares platform to help you assess the issue.

There is also a tool to assess the benefit of a transformation. You evaluated the log transformation. A generalization of the power transforms is the Box-Cox Y Transformation available in the Fit Least Squares red triangle menu under profiling.

Learn it once, use it forever!
Highlighted
Level I

## Re: Using log of variables in linear regression?

I reviewed the Box-Cox transformation capability, but I have zero and
negative values so it will not run.

Yes, I am aware of the requirement for Normality on the residuals and that
they be random and mean of zero.
I have reviewed the residual distribution to verify that.

I've also reviewed the summary of fit and the Parameter estimates to revise
the model to get down to a 0.05 level of confidence and low VIF values.

Since the range of my input variables is extreme, from low \$ Million to to
\$40 B on a variable.
I converted it to a log which cut my Adj R square value in half, but the
RMSE went from over 300 down 1.5.

Just wondering if it is reasonable to apply the same approach of converting
to a log variable to the other input variable before running the
regressions.

I do realize that the model may account for some of the skewness in the
data without converting to log_variables, so what are the decision factors
on when to convert the variable or retain it as received?

Thanks.
Highlighted
Level VI

## Re: Using log of variables in linear regression?

With the power of JMP and ease of use...my general recommendation regarding predictor or response transformations in a regression setting is unless you have some domain expertise that suggest transformation is a good idea ahead of time, why muddy the waters? There's a principle I tried to follow when modeling; "The simplest model that answers the original practical problem is the 'best' model."

It's a simple enough task to start with untransformed variables, then use all the model diagnostics tests and visualizations in JMP to see how well the untransformed model is performing. Then use that knowledge to guide the transformation process, if needed at all wrt to solving the original problem...I think this is congruent with @markbailey 's thinking?

Highlighted
Staff

## Re: Using log of variables in linear regression?

I am confused. How did you take the log of the response if it has non-positive values? Why can't you use the Box-Cox Y Transformation command if your response ranges from \$1M to \$40B?

It is difficult to compare regression statistics when the responses are different (Y versus Log Y).

You can try transfoming X variables, too.

Data that varies more than a couple orders of magnitude generally benefit from such a transformation.

You decide that the choices were helpful based on model bias and variance, parsimony, valid model assumptions, and empirical validation or cross-validation.

Is this model for predictive or explanatory purposes?

Learn it once, use it forever!
Article Labels

There are no labels assigned to this post.