Discussions

Janneman · Sep 9, 2025 06:24 AM

Hi all,

I was wondering if you have any suggestions on how to evaluate the performance of a linear model and log model (i.e. only response logarithmic transformed). I am trying to evaluate which model is better in terms of predictability and accuracy.

My first guess was to compare R-squared values and RMSE values, however I seems that two models have different units and therefore cannot be directly compared.

Kind regards, Janneman

Victor_G · Sep 9, 2025 4:35 AM

Hi @Janneman,

There might be a big list of questions to answer to better understand what is the purpose of your model and its applicability/coverage/generalization:

What is your objective behind the modeling: explainative, predictive, both ... ?
How were the data collected ? Is there any design involved in the data collection ?
More questions to help think about the expected "performance" of the model : best model

Besides the evaluation metrics you could use to compare model explainative performances (R² / R² adjusted, unitless), predictive performances (RMSE, MAE, ...), statistical significance (model p-value, individual terms p-values, ...), model complexity/adjustment (through AICc, BIC, ...), I think it might be important to first evaluate and "debug" the model through residual visualization and analysis. See Model reduction for more ideas about model comparison and selection.

Transforming a response means that you'll transform both the average response as well as variance response. You can then see if the residuals of the transformed response model better respect regression model assumptions than the original response model. If the situation doesn't seem to improve, you could perhaps use Generalized Regression Model, as they handle average response and variance response independantly : this type of model use a link function to transform the mean into a linear function of the predictor variables and a variance function to allow for variance heterogeneity in the analysis, rather than try to transform everything (for example through log transform). By transforming the response completely (through log-transform or Box-Cox Y Transformation) or using a Generalized Regression Model, you should have an equally or more simple model than your original model (less terms in the model, so lower AICc/BIC).
I wouldn't tranform the responses unless I have a strong indication that it may be needed (simplification of the model with transformed response and better residual patterns by transforming the response).

See : Difference between "least square" and "generelized linear method" in the fit model for more info about the differences between these two types of models.

Once this statistical evaluation is done and if the models can both be kept, you could then use normalized and/or unitless predictive metrics to compare the models and avoid any biased comparison due to the transformation.

Hope this answer will help you,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

View solution in original post

Victor_G · Sep 9, 2025 4:35 AM

Hi @Janneman,

There might be a big list of questions to answer to better understand what is the purpose of your model and its applicability/coverage/generalization:

What is your objective behind the modeling: explainative, predictive, both ... ?
How were the data collected ? Is there any design involved in the data collection ?
More questions to help think about the expected "performance" of the model : best model

Besides the evaluation metrics you could use to compare model explainative performances (R² / R² adjusted, unitless), predictive performances (RMSE, MAE, ...), statistical significance (model p-value, individual terms p-values, ...), model complexity/adjustment (through AICc, BIC, ...), I think it might be important to first evaluate and "debug" the model through residual visualization and analysis. See Model reduction for more ideas about model comparison and selection.

Transforming a response means that you'll transform both the average response as well as variance response. You can then see if the residuals of the transformed response model better respect regression model assumptions than the original response model. If the situation doesn't seem to improve, you could perhaps use Generalized Regression Model, as they handle average response and variance response independantly : this type of model use a link function to transform the mean into a linear function of the predictor variables and a variance function to allow for variance heterogeneity in the analysis, rather than try to transform everything (for example through log transform). By transforming the response completely (through log-transform or Box-Cox Y Transformation) or using a Generalized Regression Model, you should have an equally or more simple model than your original model (less terms in the model, so lower AICc/BIC).
I wouldn't tranform the responses unless I have a strong indication that it may be needed (simplification of the model with transformed response and better residual patterns by transforming the response).

See : Difference between "least square" and "generelized linear method" in the fit model for more info about the differences between these two types of models.

Once this statistical evaluation is done and if the models can both be kept, you could then use normalized and/or unitless predictive metrics to compare the models and avoid any biased comparison due to the transformation.

Hope this answer will help you,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

Janneman · Sep 9, 2025 5:44 AM

Hi @Victor_G,

First of all, thank you for your elaborate answer.

The objective of the model is to be both predictive and explanative. The data was collected by performing a multiple factor CCD.

From the distribution of the response there is no clear indication whether the data is normal distributed or log-normal distributed. Therefore I use both (one linear and other log transformed) to evaluate which one is more predictive and explanative.

I fit the model using the stepwise process where the factors with p < 0.05 are selected in an iterative process.

Do you think that is enough statistical evaluation to keep both models? Furthermore, the R-squared and adjusted R-squared values for the linear and log-linear model are very similar, however there is quite a large influence on the RMSE. That is why I am looking for a way to compare both models in a fair way, and chose to best model.

What other normalized and/or unitless predictive metrics could I use to compare the models and avoid any biased comparison due to the transformation? You already mentioned R2/R2adj.

rcast15 · Sep 9, 2025 09:28 AM

I would echo everything @Victor_G said in his post. The only thing I would add is around comparison of the RMSEs. The RMSE of the log-transformed model is obviously not on the same scale as the RMSE of the non-transformed model.

You could do the following to correct for this:

Get predictions in log-scale for every observation.
Back transform those predictions by exponentiating them.
1. Note here to look into the smearing retransformation (Duan) (Wikipedia has a nice entry on it). This essentially multiplies the back transformed predictions by a smearing constant to correct for the transformation you are doing (look into Jensen's inequality); otherwise, you will have biased predictions.
Use those back transformed predictions and the non-transformed observed values to calculate the residuals.
Finally, calculate the RMSE using those residuals.

You can then compare this RMSE to the RMSE of your regular linear model.

Mark_Bailey · Sep 9, 2025 09:29 AM

First, you would only expect the response to be normally distributed as assumed by the regression model under the null hypothesis. Any effects of the predictors would produce a non-normal distribution of the response. The conditional response of the residuals is a better way to assess this assumption.

Second, you could also think about the two models as an interest in the effect on the response and the other as an interest in an effect on the order of magnitude of the response.

Victor_G · Sep 9, 2025 9:05 AM

First, the normal distribution is an assumption for residuals, not the data itself, as explained by @Mark_Bailey. The idea behind is that there are no patterns left in your data after modeling, you just have random errors normally distributed, all the patterns being captured by your regression model.

Second, a Stepwise selection method might not be adequate (particularly with a p-value threshold) for designed experimental data, as you have collected data with an assumed model (response surface model linked to Central Composite design). Depending on the platform used and the Rules applied, you might create a model that do not enforce Effect Heredity, so you might end up with a apparently satifying model with high performances, but that do not generalize well these performances (not robust). Enforcing Heredity in the rules or using a Pruned approach (in GenReg model) might be better (or simply start with the full assumed model). Other pitfalls of stepwise regression :

R² value for the model is biased upwards
The F and chi-square test statistics used to determine significance of effects do not have the advertised distributions
Model coefficient confidence intervals are too small (shrunken standard-errors) and model coefficients are overestimated
P-values for model coefficients don't even make sense anymore, because they are based on implicit multiple comparisons
Collinearity in model predictors will heavily influence variables selected, which renders the selection arbitrary. Plus, we don't even properly address the collinearity.
See Stepwise Selection Algorithms Almost Always Ruin Statistical Estimates

If your objective is mainly about predictive performance, then your evaluation metrics should be chosen to prioritize and enforce this objective. Using AICc or BIC as stopping rule in Stepwise selection might be more relevant for model building, as statistical significance might not be a priority (even if important/high effect terms will have low (significant) p-values). For comparison and selection, you could calculate MAPE (Mean Absolute Percentage Error) from the residuals of your models following this equation or RMSPE (Root Mean Squared Percentage Error) with this equation.

There are other metrics available, but these "percentage" versions should help you compare the two models on the same percentage scale.

Hope this complementary answer will help you,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

Discussions

Evaluate performance linear fit VS log fit

Re: Evaluate performance linear fit VS log fit

Re: Evaluate performance linear fit VS log fit

Re: Evaluate performance linear fit VS log fit

Re: Evaluate performance linear fit VS log fit

Re: Evaluate performance linear fit VS log fit

Re: Evaluate performance linear fit VS log fit

Recommended Articles