I have a variable called 'Brand' in my dataset which essentially is the brand name of the product. There are about 2300 levels for this variable and I 'm looking to use this as one of the independent variables for predictive modelling, where price is the target variable. Can someone pls let me know how to deal with so many levels in the independent variable?
I think Phil's suggestion is a good one - if you have a lot of observations to work with. I've seen data like this, where a company recorded sales of a large number of different lines of items - and the named products were almost as many as the observations (they chose to name every item of furniture somewhat differently). In that case, I think you have to reduce the dimensions to make any sense out of the data. You might try using text explorer to do this. If the names are well defined, such as all chairs have the word "chair" in their title, then you might be able to use a series of new columns that identify products that contain key words to group the 2300 items into a much smaller set.
I would like to suggest two very different methods I have used in the past to deal with many categories. In my case they would be chronological geographical categories, but they should work in your case of brands.
Once you have these clusters by any of the methods mentioned you can add this new variable to the model. this could be done instead of using the original detailed brand names or in addition in a nested manner.
Let us know what worked for you,
There are no labels assigned to this post.