Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- JMP User Community
- :
- Discussions
- :
- Using log of variables in linear regression?

News

We’re asking you to select a content label when starting a new topic in the Discussions area. Read more to find out why.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

Highlighted
##

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Using log of variables in linear regression?

Oct 4, 2019 11:49 AM
(1472 views)

I am a new student using JMP and have a question.

I have a dataset with several variables that are heavily skewed as evidenced by their distributions. One is my dependent variable and the others are all independent variables.

I create new columns using the Log_variable for the dependent variable and one of the primary independent variables.

Both of their distributions improved significantly towards a normal distribution.

As mentioned above, I have other independent variables that are also heavily skewed and am considering converting them to logs.

My question is, is it always advisable to convert the heavily skewed independent variables to log_variable?

Or should this be used minimally?

FYI - I will run a multiple regression analysis on the variables once I have them "prepared" (i.e. address the missing values and potential outliers, created log_variables).

Thanks,

JK

4 REPLIES 4

Highlighted
##

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Re: Using log of variables in linear regression?

The requirement of normality applies to the conditional distribution of the response, not the overall or marginal distribution. It does not apply to the predictor variables at all. Some of the skew might be non-random and explained by the model! For that reason we assess the residuals. We expect them to be normally distributed, independent, and have constant variance. There are diagnostic tools in the Fit Least Squares platform to help you assess the issue.

There is also a tool to assess the benefit of a transformation. You evaluated the log transformation. A generalization of the power transforms is the Box-Cox Y Transformation available in the Fit Least Squares red triangle menu under profiling.

Learn it once, use it forever!

Highlighted
##

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Re: Using log of variables in linear regression?

I reviewed the Box-Cox transformation capability, but I have zero and

negative values so it will not run.

Yes, I am aware of the requirement for Normality on the residuals and that

they be random and mean of zero.

I have reviewed the residual distribution to verify that.

I've also reviewed the summary of fit and the Parameter estimates to revise

the model to get down to a 0.05 level of confidence and low VIF values.

Since the range of my input variables is extreme, from low $ Million to to

$40 B on a variable.

I converted it to a log which cut my Adj R square value in half, but the

RMSE went from over 300 down 1.5.

Just wondering if it is reasonable to apply the same approach of converting

to a log variable to the other input variable before running the

regressions.

I do realize that the model may account for some of the skewness in the

data without converting to log_variables, so what are the decision factors

on when to convert the variable or retain it as received?

Thanks.

negative values so it will not run.

Yes, I am aware of the requirement for Normality on the residuals and that

they be random and mean of zero.

I have reviewed the residual distribution to verify that.

I've also reviewed the summary of fit and the Parameter estimates to revise

the model to get down to a 0.05 level of confidence and low VIF values.

Since the range of my input variables is extreme, from low $ Million to to

$40 B on a variable.

I converted it to a log which cut my Adj R square value in half, but the

RMSE went from over 300 down 1.5.

Just wondering if it is reasonable to apply the same approach of converting

to a log variable to the other input variable before running the

regressions.

I do realize that the model may account for some of the skewness in the

data without converting to log_variables, so what are the decision factors

on when to convert the variable or retain it as received?

Thanks.

Highlighted
##

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Re: Using log of variables in linear regression?

With the power of JMP and ease of use...my general recommendation regarding predictor or response transformations in a regression setting is unless you have some domain expertise that suggest transformation is a good idea ahead of time, why muddy the waters? There's a principle I tried to follow when modeling; "The simplest model that answers the original practical problem is the 'best' model."

It's a simple enough task to start with untransformed variables, then use all the model diagnostics tests and visualizations in JMP to see how well the untransformed model is performing. Then use that knowledge to guide the transformation process, if needed at all wrt to solving the original problem...I think this is congruent with @markbailey 's thinking?

Highlighted
##

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Re: Using log of variables in linear regression?

I am confused. How did you take the log of the response if it has non-positive values? Why can't you use the Box-Cox Y Transformation command if your response ranges from $1M to $40B?

It is difficult to compare regression statistics when the responses are different (Y versus Log Y).

You can try transfoming X variables, too.

Data that varies more than a couple orders of magnitude generally benefit from such a transformation.

You decide that the choices were helpful based on model bias and variance, parsimony, valid model assumptions, and empirical validation or cross-validation.

Is this model for predictive or explanatory purposes?

Learn it once, use it forever!