Choose Language Hide Translation Bar
Level I

What variables to include in a real estate model vs not

I am working through an assignment which asks me whether or not to include variables for a Real estate model. The goal of the model is to predict price.  Variables include Location, zip, address, beds, baths, sq ft, price, lot size.  The data set is missing values and does have outliers as well.  I have two questions:


Is there a best practice to decide what variables to include in a model (for example if the variables have 30% missing values do you exclude or include)?

Is there a best practice to decide whether to include or exclude in a model based on continous vs nominal variables.....(for example zip vs location vs city) will we need all of those variables in a model or will one carry more weight?

Appreciate the help.  Thanks!

Level VII

Re: What variables to include in a real estate model vs not

My thoughts:

1. The selection of variables for a study should first and foremost be a function of your hypotheses.  This is your explanation as to WHY the variable should or should not have some relationship with the response variables.

2. A second issue is the problem of multicollinearity.  You need to be careful of having multiple factors that are linked to the same hypotheses.  So, for example, it seems location vs. zip seem likely collinear.

3. Both missing data and outliers can have a dramatic affect on modeling.  How much missing data is too much?  It likely depends on the size of the entire data set.  Why are you missing data?

4. I'm not sure of a best practice, as there are those that start with "all of the variables" and remove those with little effect and there are those that build the model by adding variables (stepwise).  In any case, the process is iterative and, unfortunately, the most efficient/effective methodology can depend on the situation.

5. You have more information and flexibility with continuous variables.

I suggest you start with the JMP tutorials like this one: