Re: Missing Nominal Data

jmessina944 · Nov 28, 2023 12:38 PM

Hi. I am doing a class project where I need to predict price based on a number of predictor variables for a housing dataset. I have cleaned up all my numerical rows, getting rid of outliers, now my question involves two columns: parking_options and laundry_options. Within these two columns there are about 218,000 missing values for rows. My question is what to do with the missing nominal data. It is a good portion of the dataset (out of 384,000) and I don't want to get rid of these rows entirely. I know that sometimes the data is left blank on purpose but I don't think this is my case. There are some large homes that have null values in laundry_options and parking_options so I do not think I should just assume these null values mean no parking and no laundry.

P_Bartell · Nov 28, 2023 02:17 PM

I think you have two basic options. The first is to adopt/invoke the 'informative missing' capability in JMP. Here is a link to the capability explanation in the JMP documentation: Informative Missing in JMP . But this means you think or are at least willing to assume there is some 'information' in the fact that the observations are missing. Say, people are unwilling to provide a response to something for privacy concerns or some other reason which is common among the missing observations. Essentially JMP treats the 'missingness' as an additional level for the nominal predictor variable. The second option is just bite the bullet and only go with the observations that have all values of the predictor variables. You've still got about 166,000 observations to work with. Lastly, I'd try both pathways and compare the practical conclusions that are reached by each. If there are few practical conclusions that a different then the 'missingness' didn't impact the final decisions.

hogi · Nov 28, 2023 04:06 PM

When using informative missing, does that mean that all rows with missing data belong to the same "group" - from point of view of information: that it's always the same "reason" why the information is missing?

i.e. NOT:
- missing because half of them do neither have laundry nor parking and don't want to say it and half of them are ultra rich, have 5 parking lots and 3 loundries and don't want to say it.

P_Bartell · Nov 29, 2023 07:53 AM

Kind of. Even in your second scenario, there is a common thread, an unwillingness to respond for whatever reason. That's why using informative missing is kind of a leap of faith. And why I suggested fitting the model with and without informative missing to see if the practical conclusions are different between each analysis.

dale_lehman · Nov 29, 2023 08:40 AM

As someone that teaches this stuff, I have a couple of comments, in addition to the sound advice from P_Bartell. You should try both with and without informative missing. Along the same lines, you should investigate graphically how the houses with that missing data compare with the houses with complete data regarding the depending variable you are trying to predict (presumably a price?). In other words, if the prices of houses with missing laundry options behave similarly to houses with those options (behave means the relationship with things like square footage), then perhaps those options are not that important and you can "bite the bullet" as P_Bartell puts it. However, biting the bullet can mean ignoring that variable rather than ignoring the observations with missing data. Laundry options just may not be that important, and the reason this variable is missing may be random (this is what "informative missing" will investigate, but I would supplement it with additional graphical investigation.

The other thing I noticed is your comment that got "rid of outliers." I realize some people teach people to do that and some texts even instruct people to do that. I do not. I think it is a bad habit to get rid of data. Unless the outliers appear to be mistakes, they are part of your data and I don't think you should ever get rid of them without further justification. It is far better to use a measure invariant with respect to outliers (e.g. the median rather than the mean, which you can investigate using quantile regression) than to eliminate the outliers.

Mark_Bailey · Nov 29, 2023 10:10 AM

Historically, all the great discoveries have started with outliers. The results were not expected. Throwing them out means you can't discover something new. The model might explain these outlying results.

Victor_G · Nov 29, 2023 10:20 AM

Excellent points added by @dale_lehman about outliers.
And to add to the comment of @Mark_Bailey, in the presence of outliers you can choose models that are less impacted by outliers, like tree-based methods (Boosted Tree, Decision Tree, Random Forest, ...) or add regularization to your linear model (LASSO regression for example).

You may also play with different parameters or alternative solutions, like weighing to reduce the influence of outliers on your model.

Victor GUILLER

"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)