Choose Language Hide Translation Bar
Highlighted
fmjames
Level I

How to deal with many levels for independent variable while doing modelling?

Hi,

I have a variable called 'Brand' in my dataset which essentially is the brand name of the product. There are about 2300 levels for this variable and I 'm looking to use this as one of the independent variables for predictive modelling, where price is the target variable. Can someone pls let me know how to deal with so many levels in the independent variable?

5 REPLIES 5
Highlighted
phil_kay
Staff

Re: How to deal with many levels for independent variable while doing modelling?

Here is an idea. I am not sure it is the most useful solution but I would try it.

You could create indicator variable columns: 1 column per brand (2300 indicator columns in total) with 1 or 0 indicating whether it is that brand (1) or not (0). Cols > Utilities > Make Indicator Columns. You could then use a variable selection technique like Bootstrap Forest to determine the brands that have most effect on Price. Then you could simplify the brand column to have a level for each of the most important predictor brands and then group the rest into an "other" level. That would simplify the modelling challenge.
Highlighted
dale_lehman
Level VI

Re: How to deal with many levels for independent variable while doing modelling?

I think Phil's suggestion is a good one - if you have a lot of observations to work with.  I've seen data like this, where a company recorded sales of a large number of different lines of items - and the named products were almost as many as the observations (they chose to name every item of furniture somewhat differently).  In that case, I think you have to reduce the dimensions to make any sense out of the data.  You might try using text explorer to do this.  If the names are well defined, such as all chairs have the word "chair" in their title, then you might be able to use a series of new columns that identify products that contain key words to group the 2300 items into a much smaller set.

Highlighted

Re: How to deal with many levels for independent variable while doing modelling?

Still yet another thought for you akin to @dale_lehman's idea, is to try Multiple Correspondence Analysis as a dimensionality reduction method as well?

Highlighted
ron_horne
Super User

Re: How to deal with many levels for independent variable while doing modelling?

I would like to suggest two very different methods I have used in the past to deal with many categories. In my case they would be chronological geographical categories, but they should work in your case of brands.

  • If you have any exogenous knowledge about the brands perhaps you can cluster them to a much more meaningful classification. For example, by country or by premium vs. main street vs. generic.
  • Using some method (i.e. Anova) rank the brands by price and look for any clusters and discontinuity points. Perhaps they cluster by price ranges.

Once you have these clusters by any of the methods mentioned you can add this new variable to the model. this could be done instead of using the original detailed brand names or in addition in a nested manner.

 

Let us know what worked for you,

Ron

 

Highlighted
phil_kay
Staff

Re: How to deal with many levels for independent variable while doing modelling?

Another simple way to reduce the number of levels would be with a partition model. Price as the Y and Brand as the X. You can try different numbers of splits and save the leaf numbers and/or labels to create new variables.
Article Labels

    There are no labels assigned to this post.