My domain is Consumer Research. I have 7 explanatory variables and 1 response variable. All are categorical.
My objective is to replace the response variable with one or a combination of the explanatory variables.
My approach:
- Perform contingency analyses of each of the explanatory variables with the response.
- From the contingency analysis, select the explanatory variable with the highest R-square (U). Because the number of levels are different, I don’t think Likelihood Ratio or Pearson chi-square values would be helpful in the model selection.
- From the Measures of Association table, use Lambda and Uncertainty values. Choose the explanatory variable with the highest values.
Below is a summary describing the variables and results of contingency analysis.
Variable # | # levels | Rsq(U) | LR-chi sq | Pearson chi sq | Lambda Asym (C|R, R|C) | Lambda Sym. | Uncertainty coef (C|R, R|C) | Uncertainty coef (Sym) |
A | 4 | .08 | 248 | 268 | .08, .13 | .1 | .08, .08 | .08 |
B | 3 | .02 | 76 | 75 | .06, .02 | .04 | .02, .03 | .03 |
C | 10 | .04 | 136 | 153 | .05, .04 | .045 | .04, .03 | .034 |
D | 10 | .31 | 961 | 1207 | .33, .13 | .22 | .3, .2 | .245 |
E | 18 | .34 | 1056 | 1498 | .34, .1 | .21 | .34, .19 | .24 |
F | 40 | .345 | 1084 | 1590 | .34, .05 | .17 | .35, .15 | .21 |
G | 6 | .32 | 1000 | 1286 | .35, .28 | .31 | .32, .26 | .29 |
The response variable has 6 levels.
Levels in variables, D, E, F, G are ordered by ascending intensity. It is assumed the low and high boundaries are similar.
Questions:
- Is the approach as outlined valid?
- Can I combine one of the variables from D,E,F,G ( I am inclined to select G) with one or more from A,B,C to get a better model (i.e., better replacement for the response)? If so, how might one do this and what metrics might be used to select the best model?