Solved: Categorical variables/New values for categorical variable?

Annapurna20 · Sep 28, 2017 9:28 AM

Hi,

I'm super new to predictive modeling, I'm hoping someone can help.

I have a data set that has 7 variables, 4 of which are categorical (2 nominal, 2 ordinal). I have partitioned my data into training and validation. I also have a "new" data set for which I want to run my model on.

In my training/validation partitions I have ordinal variable x1, it has the following values: b, c, d, e, f. I would like to be able to account for variable x1 having the following values in the new data set: a,b, c, d, e, f, g, h, knowing a is better than b, and g and h are worse than f. What is the best approach for doing something like this? I though perhaps I could create extra dummy variable columns to account for the new values that will are in the new data set, but it doesn't work very well.

Also, I have nominal variable x2. The only information I have is that value "J" commands a higher price than "none". Is there a way to build this into a model? Is conditional formula the way to go?

Thanks in advance for any help!

jiancao · Oct 2, 2017 7:02 AM

If I understand your #1 correctly, X1 in your training and validation data set doesn't have levels a, g and h, but your "new" data does. If so, you wouldn't be able to make predictions with x1 from the new data simply because you don't have the estimates on X1a, Xg and X1h. You could randomly redraw your training and validation after mixing two data sets.

Regarding #2, you could enter X2 into your model as an ordinal variable to account for the ordering, J vs. None. (Note-the difference between nominal coding and ordinal coding is just the interpretation of the estimates of that variable; it doesn't affect the parameter estimates of other variables except the intercept or the goodness of fit.)

View solution in original post

jiancao · Oct 2, 2017 7:02 AM

If I understand your #1 correctly, X1 in your training and validation data set doesn't have levels a, g and h, but your "new" data does. If so, you wouldn't be able to make predictions with x1 from the new data simply because you don't have the estimates on X1a, Xg and X1h. You could randomly redraw your training and validation after mixing two data sets.

Regarding #2, you could enter X2 into your model as an ordinal variable to account for the ordering, J vs. None. (Note-the difference between nominal coding and ordinal coding is just the interpretation of the estimates of that variable; it doesn't affect the parameter estimates of other variables except the intercept or the goodness of fit.)

Categorical variables/New values for categorical variable?

Re: Categorical variables/New values for categorical variable?

Re: Categorical variables/New values for categorical variable?