cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Choose Language Hide Translation Bar
View Original Published Thread

When to use data transformations

blip555555
Level I

Good afternoon, 

I am constructing a LMM and have created a QQ plot attached in image 1. The conditional residuals seem to deviate from normality a fair bit so I transformed the response variable data using log base 10. This improved my R^2 by about 2% and the QQ plot seems a bit better (image 2). However, I'm not sure if these slight improvements are worth the transformation? As I would then have to transform the data back to non-log to report it in my thesis and that would be a fold-change instead of an actual arithmetic difference. 

Thank you!

2 REPLIES 2
statman
Super User


Re: When to use data transformations

I remember when one of my instructors said "The only reason to do data transformation is to simplify the model".  BTW, that was G.E.P. Box.  I suggest you read his papers on the subject.

 

Box, G.E.P., Paul Tidwell, (1962) “Transformation of the Independent Variables”, Technometrics, Vol. 4, No. 4, November

 

also the paper and attached discussions:

 

Draper, Norman, William Hunter, (1969), "Transformations: Some Examples Revisited", Technometrics, Vol. 11, No. 1, February

"All models are wrong, some are useful" G.E.P. Box
Victor_G
Super User


Re: When to use data transformations

Hi @blip555555,

 

It's very difficult (if not impossible) to help you without an (anonymized) dataset with the situation you're facing. Please read the post Getting correct answers to correct questions quickly.

 

To assess if a transformation would be needed, it's important to look at residuals plot, to check if there is still a pattern in residuals that is not handled by the assumed model. Are you experiencing heteroscedasticity ? Or strange patterns in your residuals ? You can look at Regression Model Assumptions | Introduction to Statistics | JMP for more information.

 

I would also distinguish data transformation from Generalized Linear Mixed Models (GLM) in JMP Pro, where the response distribution can be specified (and enable to fit model with different response distributions: normal, exponential, gamma, ...).

  • Applying a non-linear (e.g., log, inverse) transformation to the dependent variables not only normalizes the residuals, but also distorts the ratio scale properties of measured variables. Transformation affects both the average response and its variance, so the error term can be greatly inflated.
  • Applying GLM and setting up the model with link function enable to stay in the original scale of the data, using a link function to transform the mean into a linear function of the predictor variables and a variance function to allow for variance heterogeneity in the analysis rather than trying to transform it away (for example through log transform). So the link function affects the mean response but not the response variance, enabling to have an error in the original scale.

You can read Difference between "least square" and "generelized linear method" in the fit model for more information.

 

Hope this conversation starter may help you,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)