I have performed Logistic Regression analysis on a data set that contains 6 binary factors and 1 continuous factor. Then I repeat the analysis after converting the continuous parameter to binary by thresholding (0 if <= threshold, 1 otherwise).
With the first analysis I get the following Lack of Fit table in the report
Lack Of Fit
Source | DF | -LogLikelihood | ChiSquare |
Lack Of Fit | 1.85e+7 | 4433.7627 | 8867.525 |
Saturated | 1.85e+7 | 387.7085 | Prob>ChiSq |
Fitted | 7 | 4821.4712 | 1.0000 |
When I repeat the analysis after converting the continuous variable to binary I get the following Lack of Fit table in which Prob>ChiSq is now 0.0009 in place of the earlier value of 1.0000.
Lack Of Fit
Source | DF | -LogLikelihood | ChiSquare |
Lack Of Fit | 120 | 87.1476 | 174.2951 |
Saturated | 127 | 4778.6278 | Prob>ChiSq |
Fitted | 7 | 4865.7753 | 0.0009* |
I cannot find the description of Lack of Fit in the documentation for Nominal Logistic Fit Report. In the context of "Lack of Fit", do I want the value to be close to 1 for the model to be fitting well to the data? What does the asterisk next to 0.0009 mean?
I do not recommend replacing a continuous predictor with a binary predictor. Binary variables are less informative.
The difference you observe is due to the change in the degrees of freedom. The first analysis includes 1.85E+7 degrees of freedom in the test of the sample statistic of 8867.525 while the second analysis includes only 120 DF for the corresponding sample statistic of 174.2951.
The lack of fit test is based on a comparison between the selected model and the saturated model (unbiased). The null hypothesis assumes that they are the same. The expected value of chi square under the null hypothesis is equal to the DF. Chi square exceeds the DF under the null hypothesis. The associated p-value informs how many such results exceed the sample statistic from the analysis.