I'm attempting to do some logistic regression modeling on a huge dataset consisting of customer behavior (Good/Bad) and demographic predictors (income, gender, etc.). I have almost 45K records but there is a lot of missing data (only about 23K complete records).
I would like to use the Validate feature in Fit Model but can't seem to get it to work. I have a column that contains only 3 unique values ("Training", "Validation", "Test") which I enter as the Validation column but it seems that it just continues to fit all the data.
I was also wondering why I always get significant lack of fit in my models. Is this simply a matter of the huge sample size?
Thanks for the response! I'm using JMP 9 and am using the Fit Model platform. I've tried several different models (main effects only and main effects plus two-factor interactions for different subsets of the predictors) and the lack of fit is always significant.
By the sound of it, I suspect that the significance of the lack of fit probably is due to the sample size. If any particular combination of demographic factors within the model as a whole is consistently over- or underestimating the observed response rate for that specific combination, that would contribute substantially to the lack of fit statistic, and an exceptionally large number of data points would then push the significance of it sky-high. All that's really saying is that you've got an awful lot of evidence that the model isn't exactly right (which you probably already knew anyway).
Probably of greater interest than the lack-of-fit would be the percentage difference between the observed and expected response: if it's really very small, the significance of the lack of fit probably wouldn't matter too much. I'm afraid I'm not familiar with the "Validate" feature yet, but I'll try it out myself with some data of my own and see what happens.
David is correct - huge sample sizes usually render most statistical tests of assumptions useless as they have sufficient power to detect any amount of change. In grad school I ran an experiment on processing efficiency in the R language and gathered so much data that I couldn't use any tests for normality, homoskedasticity, etc... had to use visual assessments.
As for the Fit Model platform - I do not see any place to specify a training/validation identifier column. Searching through the JMP help files, it appears that there should be a validation area for Logistic Regression in the Fit Model platform, but I cannot identify it. The "JMP Pro Features" site says that the validation column role in many modeling platforms is exclusive to JMP Pro, Version 9.
If you can tell me how you're specifying the validation column, I may be able to help further.
The Validation column in Fit Model Logistic is only available in JMP 9 Pro. But you can still exclude some rows from the fitting process, and use those rows later to assess model fit.
If you only have access to Standard JMP and want some sort of automated validation, the Partition platform will work. You can specify a Validation Portion, or use K-Fold CrossValidation. The Partition platform doesn't result in a nice prediction equation (rather, it gives a set of rules), but it can result in good predictions, and it will allow you to use Validation. It's also easy to assess the importance of the X variables by using the Column Contributions option.
The Neural platform in Standard JMP also provides validation. If being able to easily interpret the model coefficientrs is important, then Neural may not be what you want. But, like Partition, it can result in good predictions as long as there is structure in the data to be modeled.
The Neural and Partition platforms are intended for large amounts of data, which you have. If you use those two platforms, I strongly recommend using the Validation features, since those two platforms can easily overfit without validation. In fact, with Neural, you have to use validation.