I have simulated three data sets of 20M points each for testing LR with binary response variables with very rare events:
1) Two continuous effects: Beta0=12, beta1=1 and Beta2=1. X1 is Normal (-0.8, .25) and X2 is -ABS[Normal (0,5)]. (Each of the two variables increases the probability of the event, depending upon the value of the variable.) In this case there are about 711,000 positive events in 20M samples. Upon running LR, I get fairly good estimates of the three betas: b0 = 12.003, b1=1.005,b2=0.99996. RSquare(U)=0.623, AICc = 2,314,760 and BIC=2,314,804. The ROC AUC=0.97835.
2) The second data set has one continuous and one categorical effect. For this data, I replace the second continuous variable in the first data set with a binary variable occurring with probability 0.05 and a Beta2 = -2. This data contains just 394 events in the total of 20M samples. Upon running LR, the estimates are : b0 = 11.0988, b1=1.184, b2=-1.01. RSquare(U)=0.0283, AICc = 9,068 and BIC=9,112. The ROC with AUC = 0.68. So, while the Beta0 and Beta1 estimates are reasonable, estimate of Beta2 for the nominal variable appears to be half the true value.
3) To generate the third data set I use both effects as binomially generated binary variables. I replace one continuous variable in the first data set with a binary variable occurring with probability 0.05 and a Beta2 = -2. I replace the second continuous variable with another binomially generated random variable with probability 0.1 and Beta = -1. This data contains just 165 events in the total of 20M samples. Upon running LR, the estimated parameter values are: b0 = 10.582, b1 = - 1.14, b2 = - 0.512. RSquare(U)=0.0384, AICc = 4,038 and BIC=4,082. The ROC AUC=0.6756. So, while the Beta0 estimate is reasonable, estimates of Beta1 and Beta2 for both nominal variables appears to be half the true value
What am I missing here? Why should the estimates of coefficients of categorical variables be half their true values?
My next tests are going to be with under-sampling the non-events and then applying the under-sampling correction to the Beta0. But, first I would like to understand the estimates generated by LR in JMP. Any suggestions or clarification would be greatly appreciated.