Unequal sampling ratios in binary logistic regression
Nov 2, 2016 7:55 AM(1440 views)
I recently had a series of binary logistic regressions reviewed, and was informed by the reviewer that my results were erroneous due to unequal sampling ratios in my input data. I have 2 responses: presence and absence, and due to the overall rarity of presence and prevalence of absence data, I set the absence data to be 2x the presence data. What I didn't realize was that the intercept term of the equation includes a log odds ratio of the input sample sizes, so each model is not just a function of the environmental pattern, but also a function of the relative prevalence of each class expressed by a sampling ratio. Because my ratio was 1:2, I needed to correct the intercept by adding the log odds ratio to the intercepts of each model: ln(n2/n1), or ln(2/1), or ln(2) = 0.69315. Without doing this, apparently my models are biased toward prediction of absence as opposed to presence. Apparently JMP produces erroneous results by not accounting for this class ratio pattern? I am not a statistician and cannot independently confirm this, but I wanted to point this out to see if anyone has further input. Cheers.