I’m copying this to the blog so I can assemble answers here. I expect this could be interesting.
I was struck today by the way the Internet has accelerated research. At one time, it might have taken a month or two to track down the articles on this problem and conclude I need to ask for advice. Now, however, I realize the need within hours.
Recall the question that started us debating a few days ago was a logistic regression in which OP noticed the mis-match between the predicted probability of success and the observed fraction.
(R-sig-mixed, 2013-04-03, OP: Zack Steel, “Low intercept estimate in a binomial glmm”)
We were debating that, and it had completely slipped my mind that there is a separate literature on exactly that kind of problem. Yesterday, somebody else asked me to estimate a logit model in which there were more than 40000 cases but only a few hundred “successes”. That’s what reminded me of the “rare events” problem and logistic regression parameter estimate bias.
And I think that’s the issue that we need to clear up with glmer. What do you think? Since multilevel model can be seen as a penalized ML estimation (ala Pinheiro and Bates, or as explained in Simon Wood, Generalized Additive Models), are we able to get a bias-corrected variant?
Furthermore, could lme4’s predict method be made to produce “good” confidence intervals. And that leads down a separate path to a huge hassle about competing ways to estimate CI’s in glm and the possible need to appy extra corrections in some special cases. I’ll write down that problem to ask you about it later if you help me understand this one.
Here’s my brief novel on what I’ve been Googling about for the past 10 hours or so. If it helps you, let me know. If you think I’m wrong, especially urgently let me know.
To the political science audience, that’s a “rare events” logistic regression problem, our most heavily cited methods paper on that is:
King, G., & Zeng, L. (2001). Logistic Regression in Rare Events Data. Political Analysis, 9(2), 137–163.
Logistic parameter estimates (mainly the intercept) are wrong and estimated probabilities are wrong. King & Zeng provided Stata code for a function “relogit” and later adapted same for R (package: Zelig). Zelig tries to re-organize the whole regression experience for the R user, and I didn’t want that, so I started looking into the various corrections to see if I couldn’t write an adapter to take a glm or a glmer output and “bias correct” it. It appears, superficially at least, that I only need to adjust the intercept estimate by a weighting factor, which would be super easy to do.
Quite by chance, I found this blog post by Paul Allison, and its really interesting!
Logistic Regression for Rare Events (2012-02-13)
And, wow, is it subtle. Read that over a few times, see if you agree with me. In a kind way, he says the “rare events” business is a red herring, and instead we need bias-corrected logistic regression estimates. Use David Firth’s method. The part about the “prior correction of the intercept” discussed in King and Zeng, is not the best approach. Instead, we should see this as a symptom of the more general problem that ML estimates are biased and the bias is greatest when there are not too many “successes”. Allison suggests an estimator proposed by David Firth, which used penalized ML.
Firth D. Bias reduction of maximum likelihood estimates. Biometrika 1993; 80:27–38
I don’t think King and Zeng disagree, they also propose an option to bias-correct the whole vector of coefficients. That bias correction ends up addressing the more general problem. In the Stata module for relogit (the version I found was dated 1999-10-28), it says “”Relogit for Stata does not yet support the FIRTH option”, but it does have an alternative weighting correction.
While fiddling around to see if I could implement that, I learned it has been done in R:
logistf: Firth’s bias reduced logistic regression
That is often discussed as a solution to the problem of separation, as on the UCLA stats website, (http://www.ats.ucla.edu/stat/mult_pkg/faq/general/complete_separation_logit_models.htm)
Georg Heinze and Michael Schemper, A solution to the problem of separation in logistic regression, Statistics in Medicine, 2002, vol. 21 2409-2419.
But it is a two-fer, so far as I can tell. We get bias correction and separation-proofness.
Heinze, G., & Puhr, R. (2010). Bias-reduced and separation-proof conditional logistic regression with small or sparse data sets. Statistics in medicine, 29(7-8), 770–777. doi:10.1002/sim.3794
Heinze and Ladner offer an R package “logistiX”, for “exact” logistic regression.
The part I don’t understand (yet) is how the bias correction links to mixed models. And that’s why I’m asking you.