Solved: Interpretation of Dummy Variables in Stepwise Regression wtih {0-1} Next to Vari...

AlphaStarfish74 · Nov 21, 2023 01:53 PM

Just a quick question on interpreting my dummy variables in a stepwise regression.

I have two categorical variables with two categories in each (Lower vs Upper cluster, and Male vs Female) coded as: Male = 1, and Lower Cluster = 1.

I am confused by the { 0 - 1 } next to the categorical variable name under "parameter" in the screenshot below. Take the variable gender: does this mean a value of 1 (male) has a coeff of 0.947, or a value of 0 (female) has a coeff of 0.947?

Victor_G · Nov 22, 2023 03:29 AM

Hello @AlphaStarfish74,

Welcome in the Community !

Depending on which modeling platform you use, the coding of nominal factors can be different. You don't need to code the factors by yourself, you could have left the levels "Male/female" in the column "Gender", or "Cluster1/Cluster2" in the column "Cluster Lower".

In the Stepwise platform with the rules "Combine", the categorical variables are coded in a hierarchical fashion. The values you're seeing between brackets show the levels grouped in the term that most separate the mean of the response. In your case since you have only two levels for each of your categorical factor, you only see {L1-L2} with L1 and L2 the corresponding levels names of your factor.

Concerning the parameter estimate calculated, this represent the half difference in mean response when you go from level L2 to L1 on the considered factor (with the notation {L1-L2}). So in your case for "Gender", if you change the level from 1 to 0, this results in augmenting the mean response by 2x the corresponding estimate, so approximately 1,896.

You can find more information about the nominal coding of factors in the different platforms here : Models with Nominal and Ordinal Effects

And an example about nominal factor in model : Example of a Model with a Nominal Term

Hope this answer will help you,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

View solution in original post

Victor_G · Nov 22, 2023 03:29 AM

Hello @AlphaStarfish74,

Welcome in the Community !

Depending on which modeling platform you use, the coding of nominal factors can be different. You don't need to code the factors by yourself, you could have left the levels "Male/female" in the column "Gender", or "Cluster1/Cluster2" in the column "Cluster Lower".

In the Stepwise platform with the rules "Combine", the categorical variables are coded in a hierarchical fashion. The values you're seeing between brackets show the levels grouped in the term that most separate the mean of the response. In your case since you have only two levels for each of your categorical factor, you only see {L1-L2} with L1 and L2 the corresponding levels names of your factor.

Concerning the parameter estimate calculated, this represent the half difference in mean response when you go from level L2 to L1 on the considered factor (with the notation {L1-L2}). So in your case for "Gender", if you change the level from 1 to 0, this results in augmenting the mean response by 2x the corresponding estimate, so approximately 1,896.

You can find more information about the nominal coding of factors in the different platforms here : Models with Nominal and Ordinal Effects

And an example about nominal factor in model : Example of a Model with a Nominal Term

Hope this answer will help you,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

AlphaStarfish74 · Nov 24, 2023 02:02 PM

Hey thanks so much for this response, I really appreciate it! Links provided are great.

I think what had me second guessing was when I ran the nominal variable alone it looked like females were more negative, but when I added it into the final model it looked like they were more positive.... not sure why the sign changed from "-" to "+"

vs

Any thoughts on why this could be?

Victor_G · Nov 24, 2023 03:02 PM

Hi @AlphaStarfish74,

Glad the answer was helpful !

On your second question, there may be two reasons that explain why the parameter estimate for "Sex" is different :

You're comparing two models (with very different explainability performance through highly different R²) that do not include the same effects, so the estimates will be different. A model is not a fixed equation, it changes based on which effects it includes (and the type of modeling/analysis). In your first case, every responses is modeled through a very simple model, Comp Score = Intercept + a1x[Sex].
In the second, you still have the same coefficient to estimate, but also the ones from "Education", "Village Cluster" and the other factors. So in order to take into account the influence of these additional factors without changing the response values, the estimate of "Sex" will be different, as well as the Intercept value.
Depending on the modeling platform, the coding of nominal factor is different, which can also explain the difference in parameter estimates values. You can have more details about the coding of nominal factors through "Stepwise" and "Fit Model" platform here :
- Nominal coding in Fit Model platform
- Nominal coding in Stepwise platform

I hope this complentary answer will help you uderstand your models' differences,

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)

Interpretation of Dummy Variables in Stepwise Regression wtih {0-1} Next to Variable Name

Re: Interpretation of Dummy Variables in Stepwise Regression wtih {0-1} Next to Variable Name

Re: Interpretation of Dummy Variables in Stepwise Regression wtih {0-1} Next to Variable Name

Re: Interpretation of Dummy Variables in Stepwise Regression wtih {0-1} Next to Variable Name

Re: Interpretation of Dummy Variables in Stepwise Regression wtih {0-1} Next to Variable Name