I have simulated three data sets of 20M points each for testing LR with binary response variables with very rare events:
1) Two continuous effects: Beta0=12, beta1=1 and Beta2=1. X1 is Normal (-0.8, .25) and X2 is -ABS[Normal (0,5)]. (Each of the two variables increases the probability of the event, depending upon the value of the variable.) In this case there are about 711,000 positive events in 20M samples. Upon running LR, I get fairly good estimates of the three betas: b0 = 12.003, b1=1.005,b2=0.99996. RSquare(U)=0.623, AICc = 2,314,760 and BIC=2,314,804. The ROC AUC=0.97835.
2) The second data set has one continuous and one categorical effect. For this data, I replace the second continuous variable in the first data set with a binary variable occurring with probability 0.05 and a Beta2 = -2. This data contains just 394 events in the total of 20M samples. Upon running LR, the estimates are : b0 = 11.0988, b1=1.184, b2=-1.01. RSquare(U)=0.0283, AICc = 9,068 and BIC=9,112. The ROC with AUC = 0.68. So, while the Beta0 and Beta1 estimates are reasonable, estimate of Beta2 for the nominal variable appears to be half the true value.
3) To generate the third data set I use both effects as binomially generated binary variables. I replace one continuous variable in the first data set with a binary variable occurring with probability 0.05 and a Beta2 = -2. I replace the second continuous variable with another binomially generated random variable with probability 0.1 and Beta = -1. This data contains just 165 events in the total of 20M samples. Upon running LR, the estimated parameter values are: b0 = 10.582, b1 = - 1.14, b2 = - 0.512. RSquare(U)=0.0384, AICc = 4,038 and BIC=4,082. The ROC AUC=0.6756. So, while the Beta0 estimate is reasonable, estimates of Beta1 and Beta2 for both nominal variables appears to be half the true value
What am I missing here? Why should the estimates of coefficients of categorical variables be half their true values?
My next tests are going to be with under-sampling the non-events and then applying the under-sampling correction to the Beta0. But, first I would like to understand the estimates generated by LR in JMP. Any suggestions or clarification would be greatly appreciated.
Not as familiar with JMP, but here's some reasons why it happens in SAS
1. Check how you coded your binary variable (0/1) is different than 1/2
2. If you used some sort of automatic categories check how it is parametrized, ie effect coding or referential coding.
I code the nominal variable as 0 and the Beta value and then use these values (what they happen to be at each entry) in the P(event) calculation by the logistic expression (1/(1+exp(beta0+Beta1+Beta2))). I assumed that the entry in the effects column will be used as just a 'label' when that column is designated as Nominal Effect. Are you suggesting that the value entered in the column is used somehow, if 0/1 and 1/2 makes a difference?
Code it as 0/1 and don't include in nominal effect to see if you get the desired results.
See the link above to how JMP categorizes nominal variables.
I coded the nominal variables as 0/1 as you recommend and then did not use them, and applied the appropriate betas in the P calculation. Now the estimates of the three betas are: 10.52, 0.9848 and 0.6534. So, the Beta1 and Beta2 that correspond to the nominal variables still appear to be about half the true value, but the sign is now opposite to what it should be. True Beta1 of -2 gets estimated as 0.9848 and the true Beta2 of -1 gets estimated as 0.6534. I am attaching a short version of the file if it helps. (if you do have time to run the LR in JMP, the file will have to be extended quite a bit to get enough events. I use a file of 20M rows.
The link you sent applies to Linear Model. I am working with Logistic Regression.
How it codes the categorical factor is the issue, not the type of model. Not that there are -1 in the list.
Try the 1/0 coding
email@example.com Apr 14, 2014 10:44 AM
I coded the nominal variables as 0/1 as you recommend and then did not use them, and applied the appropriate betas in the P calculation.
I don't know what that means...
You can always try contacting tech support, if you're not getting an answer on here. They have people more familiar with JMP.