#LyX 1.4.3 created this file. For more info see http://www.lyx.org/ \lyxformat 245 \begin_document \begin_header \textclass literate-article \begin_preamble \usepackage{latexsym} \usepackage{graphicx} \usepackage{psfig} \usepackage{color} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% User specified LaTeX commands. \usepackage{ragged2e} \RaggedRight \setlength{\parindent}{1 em} \end_preamble \language english \inputencoding latin1 \fontscheme times \graphics default \paperfontsize 12 \spacing single \papersize letterpaper \use_geometry true \use_amsmath 1 \cite_engine basic \use_bibtopic false \paperorientation portrait \leftmargin 1in \topmargin 1in \rightmargin 1in \bottommargin 1in \secnumdepth 3 \tocdepth 3 \paragraph_separation indent \defskip medskip \quotes_language english \papercolumns 1 \papersides 1 \paperpagestyle default \tracking_changes false \output_changes true \end_header \begin_body \begin_layout Title Logistic Regression Introduction (day 1) \end_layout \begin_layout Author Paul Johnson \end_layout \begin_layout Standard This note deals about problems with OLS when the dependent variable is binary (aka dichotomous, a dummy variable, etc.) and some solutions. \end_layout \begin_layout Section OLS Doesn't Cut It! \end_layout \begin_layout Standard \begin_inset LatexCommand \label{sec:sec1} \end_inset \end_layout \begin_layout Subsection \begin_inset Formula $y_{i}$ \end_inset is dichotomous \end_layout \begin_layout Standard Suppose \begin_inset Formula $y_{i}$ \end_inset is coded 0 and 1, representing answers to a Yes or No question. Until the mid 1970s, linear models were used to predict the value of \begin_inset Formula $y_{i}$ \end_inset . \end_layout \begin_layout Standard Let's make a graph of an example dataset. Here's some phony data. We need an input that is more or less continuous (for obvious reasons, right?). \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Standard <<>>= \end_layout \begin_layout Standard x <- 50 + 10 * rnorm(50) \end_layout \begin_layout Standard \end_layout \begin_layout Standard y <- rep(c(0,1),25) \end_layout \begin_layout Standard \end_layout \begin_layout Standard @ \end_layout \end_inset \end_layout \begin_layout Standard Consider Figure \begin_inset LatexCommand \ref{dichplot1} \end_inset , which is obtained from the plot command in R: \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Standard <>= \end_layout \begin_layout Standard plot(x,y,ylim=c(-.2,1.2),xlab="Normally distributed input",ylab="dichotomous output",type="n") \end_layout \begin_layout Standard \end_layout \begin_layout Standard points(x,y,pch=16,cex=0.3) \end_layout \begin_layout Standard \end_layout \begin_layout Standard @ \end_layout \begin_layout Standard \end_layout \begin_layout Standard \end_layout \end_inset \end_layout \begin_layout Standard \begin_inset Float figure wide false sideways false status open \begin_layout Caption Plot a dichotomous Y if you want to \end_layout \begin_layout Standard \begin_inset LatexCommand \label{dichplot1} \end_inset \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Standard <>= \end_layout \begin_layout Standard <> \end_layout \begin_layout Standard \end_layout \begin_layout Standard @ \end_layout \end_inset \end_layout \end_inset \end_layout \begin_layout Standard \begin_inset VSpace medskip \end_inset Go ahead, meditate on the prospect of drawing a straight line through Figure \begin_inset LatexCommand \ref{dichplot1} \end_inset . \begin_inset VSpace medskip \end_inset \end_layout \begin_layout Standard The more I teach regression classes, the more peculiar this next step seems to me. \end_layout \begin_layout Standard But all the famous books do it, so go ahead. Suppose you have the regular old regression model: \end_layout \begin_layout Standard \begin_inset Formula \begin{equation} y_{i}=b_{0}+b_{1}X_{i}+e_{i}\label{ols1}\end{equation} \end_inset \end_layout \begin_layout Standard The predicted value might be interpreted as the probability of \begin_inset Formula $y_{i}=1$ \end_inset . Hence, \begin_inset LatexCommand \ref{ols1} \end_inset is often called the \series bold linear probability model \series default . \end_layout \begin_layout Standard In case you have not tried it yet, R can give output in LaTeX format. I am most familiar with the results from the xtable library (but the Design library has a latex method for its models as well). If you run this code interactively, it creates a 'dump' in LaTeXformat that you can cut and paste into a LaTeX document. \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Standard <>= \end_layout \begin_layout Standard modl1 <- lm(y~x) \end_layout \begin_layout Standard library(xtable) \end_layout \begin_layout Standard xtable(modl1,type="tex",caption="A Regression With Pretend Data",label="regTab1" ) \end_layout \begin_layout Standard @ \end_layout \end_inset \end_layout \begin_layout Standard The R output looks like this: \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Standard <>= \end_layout \begin_layout Standard <> \end_layout \begin_layout Standard \end_layout \begin_layout Standard @ \end_layout \end_inset \end_layout \begin_layout Standard In a LaTeX document, the table is a \begin_inset Quotes eld \end_inset floating \begin_inset Quotes erd \end_inset object, which appears as Table \begin_inset ERT status collapsed \begin_layout Standard \backslash ref{regTab1} \end_layout \end_inset . \end_layout \begin_layout Standard This does not produce a \begin_inset Quotes eld \end_inset final, publishable \begin_inset Quotes erd \end_inset table, but it does show you how a LaTeX table is constructed. A final table should include the diagnostic information, such as \begin_inset Formula $R^{2}$ \end_inset and the \begin_inset Quotes eld \end_inset residual standard error \begin_inset Quotes erd \end_inset (same as \begin_inset Quotes eld \end_inset root mean squared error \begin_inset Quotes erd \end_inset ). But we can make due, editing the output to suit our needs. \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Standard <>= \end_layout \begin_layout Standard \end_layout \begin_layout Standard <> \end_layout \begin_layout Standard \end_layout \begin_layout Standard @ \end_layout \end_inset \end_layout \begin_layout Standard In 2005, I wrote an R function \begin_inset Quotes eld \end_inset outreg \begin_inset Quotes erd \end_inset that produces nice output for linear models. It works for anything that falls within the generalized linear model family. Outreg can produce model tables in either a \begin_inset Quotes eld \end_inset wide \begin_inset Quotes erd \end_inset format, with one column for the coefficient estimates and one for the standard errors, or a \begin_inset Quotes eld \end_inset tight \begin_inset Quotes erd \end_inset format, with one column that interleaves the coefficients and standard errors. \end_layout \begin_layout Standard In Figure \begin_inset LatexCommand \ref{dichplot2} \end_inset , the regression line is added with this code:. \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Standard <>= \end_layout \begin_layout Standard \end_layout \begin_layout Standard plot(x,y,ylim=c(-.2,1.2),xlab="Normally distributed input",ylab="dichotomous output",type="n") \end_layout \begin_layout Standard \end_layout \begin_layout Standard points(x,y,pch=16,cex=0.3) \end_layout \begin_layout Standard \end_layout \begin_layout Standard abline(modl1) \end_layout \begin_layout Standard \end_layout \begin_layout Standard @ \end_layout \end_inset \end_layout \begin_layout Standard \begin_inset Float figure wide false sideways false status open \begin_layout Caption Plot an OLS fit to a dichotomous Y, if you want to \end_layout \begin_layout Standard \begin_inset LatexCommand \label{dichplot2} \end_inset \begin_inset ERT status collapsed \begin_layout Standard \backslash centering \end_layout \begin_layout Standard \end_layout \end_inset \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Standard <>= \end_layout \begin_layout Standard \end_layout \begin_layout Standard <> \end_layout \begin_layout Standard \end_layout \begin_layout Standard @ \end_layout \end_inset \end_layout \end_inset \end_layout \begin_layout Subsection What's wrong with that model? \end_layout \begin_layout Subsubsection OLS predicts out of range. \end_layout \begin_layout Standard As long as \begin_inset Formula $b_{1}\ne0$ \end_inset , the line representing predicted values will go above 1 and below 0. You might constrain the predictions so that they can't go above 0 or 1. This is a constrained linear probability model. It has all kinds of unattractive features, including 1) sharp kinks in the predicted value curve, and 2) the method does not take the constraints into account in the estimation process. \end_layout \begin_layout Standard This is evident if we construct a more extreme example. \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Standard <<>>= \end_layout \begin_layout Standard x <- seq(0,150,length=200) \end_layout \begin_layout Standard expbx <- exp((-1)*(-10+.115*x)) \end_layout \begin_layout Standard ProbY1 <- 1/(1+expbx) \end_layout \begin_layout Standard for (i in 1:200){y[i] <- rbinom(1,1, ProbY1[i]) } \end_layout \begin_layout Standard @ \end_layout \end_inset \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Standard <>= \end_layout \begin_layout Standard modl2 <- lm(y~x) \end_layout \begin_layout Standard @ \end_layout \end_inset \end_layout \begin_layout Standard \begin_inset Float figure wide false sideways false status open \begin_layout Caption OLS out of fitted-values range \end_layout \begin_layout Standard \begin_inset LatexCommand \label{dichplot3} \end_inset \begin_inset ERT status collapsed \begin_layout Standard \backslash centering \end_layout \begin_layout Standard \end_layout \end_inset \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Standard <>= \end_layout \begin_layout Standard plot(x, y, ylim=c(-0.3,1.5),type="n"); \end_layout \begin_layout Standard \end_layout \begin_layout Standard points(x,y,cex=0.3,pch=16) \end_layout \begin_layout Standard \end_layout \begin_layout Standard abline(modl2); \end_layout \begin_layout Standard \end_layout \begin_layout Standard lines(c(0,150),c(0,0),lty=c(2)); lines(c(0,150),c(1,1),lty=c(2)); \end_layout \begin_layout Standard \end_layout \begin_layout Standard @ \end_layout \end_inset \end_layout \end_inset \end_layout \begin_layout Standard Take a gander at Figure \begin_inset LatexCommand \ref{dichplot3} \end_inset . What do you say about the bottom left and top right? \end_layout \begin_layout Subsubsection Guaranteed heteroskedasticity. \end_layout \begin_layout Standard Recall that with OLS the assumption that \begin_inset Formula $E(e_{i})=0$ \end_inset is vital--we can't live without it. Requiring \begin_inset Formula $E(e_{i})=0$ \end_inset makes for some funny results, not the least of which is heteroskedasticity. \end_layout \begin_layout Standard For any given value of \begin_inset Formula $X_{i}$ \end_inset , note that the error term has to make up the difference between \begin_inset Formula $b_{0}+b_{1}\cdot X_{i}$ \end_inset and the true value, 1 or 0. (Make a picture here). Hence the error term must be either \begin_inset Formula $1-b_{0}-b_{1}\cdot X_{i}$ \end_inset or \begin_inset Formula $+b_{0}+b_{1}\cdot X_{i}$ \end_inset . \end_layout \begin_layout Standard Let \begin_inset Formula $P_{i}$ \end_inset be the probability that \begin_inset Formula $y_{i}$ \end_inset is 1. That is, \begin_inset Formula $P_{i}=b_{0}+b_{1}\cdot X_{i}$ \end_inset . Note this is not the predicted value from an estimated equation--it is the probability from the equation with the true values of \begin_inset Formula $b_{0}$ \end_inset and \begin_inset Formula $b_{1}$ \end_inset inserted. \end_layout \begin_layout Standard To repeat the story in the previous paragraph, if \begin_inset Formula $y_{i}=1$ \end_inset , then the error term must be \begin_inset Formula $1-P_{i}$ \end_inset , because this amount goes from \begin_inset Formula $P_{i}$ \end_inset up to 1. Similarly, the probability that \begin_inset Formula $y_{i}=0$ \end_inset is \begin_inset Formula $(1-P_{i})$ \end_inset . And, if \begin_inset Formula $y_{i}=0$ \end_inset , that must mean the error term is \begin_inset Formula $-P_{i}$ \end_inset , because that is the amount you have to take away from \begin_inset Formula $P_{i}$ \end_inset to get down to zero. Hence, for a particular value of \begin_inset Formula $X_{i}$ \end_inset , the expected value of the error term must be: \end_layout \begin_layout Standard \begin_inset Formula \begin{equation} E[e_{i}]=(1-P_{i})P_{i}+P_{i}(1-P_{i})=0\label{eqn}\end{equation} \end_inset \end_layout \begin_layout Standard By the same logic, the variance of the error term is \end_layout \begin_layout Standard \begin_inset Formula \begin{equation} \begin{array}{c} E(e_{i}^{2})=(1-P_{i})P_{i}^{2}+P_{i}(1-P_{i})^{2}\\ =P_{i}(1-P_{i})\\ =(b_{0}+b_{1}X_{i})(1-b_{0}-b_{1}X_{i})\end{array}\end{equation} \end_inset \end_layout \begin_layout Standard As you can plainly see, the variance of the error term depends on \begin_inset Formula $X_{i}$ \end_inset . There is heteroskedasticity by definition. You might treat this with WLS, but in small samples that not a very desirable proposition. \end_layout \begin_layout Standard 3. The question of functional form. Wouldn't it be more elegant to assume probability that \begin_inset Formula $y_{i}$ \end_inset is 1 is always between 0 and 1 and changes gradually as \begin_inset Formula $X_{i}$ \end_inset changes? The logit and probit models are pleasant alternatives. \end_layout \begin_layout Section The logit alternative (approach #1) \end_layout \begin_layout Standard \begin_inset LatexCommand \label{sec:logit1} \end_inset \end_layout \begin_layout Subsection Logistic curve is a particular "S-shaped" curve \end_layout \begin_layout Standard What is the logit model? An S-shaped curve. It is assumed the probability of \begin_inset Formula $y_{i}$ \end_inset being \begin_inset Formula $1$ \end_inset is given by this formula: \end_layout \begin_layout Standard \begin_inset Formula \begin{equation} P(y_{i}=1|X_{i},b)=\frac{1}{1+e^{-(b_{0}+b_{1}X_{i})}}\label{logit1}\end{equation} \end_inset \end_layout \begin_layout Standard You can use R to make some illustrations. For example, the formula with \begin_inset Formula $b_{0}=10$ \end_inset and \begin_inset Formula $b_{1}=0.115$ \end_inset is shown in Figure \begin_inset LatexCommand \ref{fig:An-S-shaped-Curve} \end_inset . \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Standard <>= \end_layout \begin_layout Standard X2 <- seq(0,150,length=200) \end_layout \begin_layout Standard expbx <- exp((-1)*(-10+.115*X2)) \end_layout \begin_layout Standard ProbY1 <- 1/(1+expbx) \end_layout \begin_layout Standard @ \end_layout \end_inset \end_layout \begin_layout Standard \begin_inset Float figure wide false sideways false status open \begin_layout Caption An S-shaped Curve \begin_inset LatexCommand \label{fig:An-S-shaped-Curve} \end_inset \end_layout \begin_layout Standard \end_layout \begin_layout Standard \begin_inset ERT status open \begin_layout Standard <>= \end_layout \begin_layout Standard plot(X2,ProbY1,type="l",xlab='X',ylab='y') \end_layout \begin_layout Standard @ \end_layout \end_inset \end_layout \end_inset \end_layout \begin_layout Subsection Notes about this: \end_layout \begin_layout Standard Here are some highlights. \end_layout \begin_layout Enumerate There does not seem to be an "error term." That's a really subtle problem. \end_layout \begin_layout Enumerate It has a pleasant, symmetric S shape. \end_layout \begin_deeper \begin_layout Standard There's an artistic generalization for you. Try to think of some ways in which it is beautiful! \end_layout \end_deeper \begin_layout Enumerate The slope--change in probability resulting from a unit increase in \begin_inset Formula $X_{i}$ \end_inset -- is \begin_inset Formula $b_{1}\cdot P_{i}\cdot(1-P_{i})$ \end_inset . Hence, the effect of a unit change in \begin_inset Formula $X_{i}$ \end_inset depends on the probability. If \begin_inset Formula $y_{i}$ \end_inset is very likely to be a 1 or a 0, a change in \begin_inset Formula $X_{i}$ \end_inset doesn't make much difference. \end_layout \begin_layout Enumerate The "odds ratio" is \begin_inset Formula $\frac{P_{i}}{(1-P_{i})}$ \end_inset . It can be shown that \end_layout \begin_deeper \begin_layout Standard \begin_inset Formula \begin{equation} \ln\left[\frac{P_{i}}{1-P_{i}}\right]=b_{0}+b_{1}\cdot X_{i}\end{equation} \end_inset \end_layout \end_deeper \begin_layout Subsection Estimation \end_layout \begin_layout Standard How are logit models estimated? This one is a straightforward exercise in maximum likelihood. \end_layout \begin_layout Standard What is the likelihood that we would observe a given sample of 0's and 1's? Put the observations with 0's first and then the 1's. The first critical assumption is that the observations are statistically independent, meaning the probability of the sample equals the individual probabilities multiplied together. Hence, \end_layout \begin_layout Standard \begin_inset Formula \begin{eqnarray} P(y_{1}=0,y_{2}=0,...,y_{m}=0,y_{m+1}=1,y_{m+2}=1,...,y_{N}=1)\\ =P(y_{1}=0)P(y_{2}=0)\cdots P(y_{m}=0)P(y_{m+1}=1)P(y_{m+2}=1)\cdots P(y_{N}=1)\end{eqnarray} \end_inset \end_layout \begin_layout Standard This expression is the likelihood function, L, and since the probabilities depend on parameters \begin_inset Formula $b_{0}$ \end_inset and \begin_inset Formula $b_{1}$ \end_inset , we might as well write \begin_inset Formula $L(b_{0},b_{1})$ \end_inset . \end_layout \begin_layout Standard Remembering that the probability that \begin_inset Formula $y_{i}=0$ \end_inset is \begin_inset Formula $1$ \end_inset minus the probability that \begin_inset Formula $y_{i}=1$ \end_inset , we can write \begin_inset Formula \begin{eqnarray} L(b_{0},b_{1})=(1-P(y_{1}=1))(1-P(y_{2}=1))\cdots(1-P(y_{m}=1))\nonumber \\ \times P(y_{m+1}=1)P(y_{m+2}=1)\cdots P(y_{N}=1)\end{eqnarray} \end_inset \end_layout \begin_layout Standard This notation can be made a little more compact. It is not necessary to keep writing down the \begin_inset Formula $P(y_{i}=1)$ \end_inset over and over again. Instead, save a little time and effort by writing \begin_inset Formula $P_{i}$ \end_inset for this. \end_layout \begin_layout Standard The Likelihood function is an impossibly complicated formula because it is composed of numbers that are multiplied together. The multiplication means that none of the components are separable. In contrast, if we work with logarithms, then the product is convertd to a sum. It is mathematically identical to maximize \begin_inset Formula $L$ \end_inset or the log of \begin_inset Formula $L$ \end_inset . \end_layout \begin_layout Standard \begin_inset Formula \begin{equation} \ln L(b_{0},b_{1})=\ln(1-P_{1})+\ln(1-P_{2})+\cdots+\ln(1-P_{m})+\ln(P_{m+1})+\cdots+\ln(P_{N})\end{equation} \end_inset (fill in the logistic formula \begin_inset Formula $P_{i}=\frac{1}{1+e^{-(b_{0}+b_{1}X_{i})}}$ \end_inset to here to get an idea of where the \begin_inset Formula $b_{0}$ \end_inset and \begin_inset Formula $b_{1}$ \end_inset fit in.) \end_layout \begin_layout Standard MLE, short for Maximum Likelihood Estimate, is the choice of estimators \begin_inset Formula $b_{0}$ \end_inset , \begin_inset Formula $b_{1}$ \end_inset that maximize the log of the likelihood function. This solution is also a maximizer of L. \end_layout \begin_layout Subsection Quick summary of MLE properties. \end_layout \begin_layout Enumerate NOT unbiased. \end_layout \begin_layout Enumerate MLE's are consistent, asymptotically efficient, asymptotically Normal. The asymptotic normality implies that we can conduct approximate t-tests, as long as we can get estimates of the standard errors of \begin_inset Formula $b_{0}$ \end_inset and \begin_inset Formula $b_{1}$ \end_inset . \end_layout \begin_layout Enumerate There are ways to calculate asymptotic standard errors. They tell you, approximately, if you had an infinite sample, what the standard error of would be. They are sometimes called approximate standard errors because you never have an infinite sample. \end_layout \begin_layout Enumerate Maximum Likelihood Estimation allows an equivalent of the F test. \end_layout \begin_deeper \begin_layout Standard A. Let \begin_inset Formula $L_{0}$ \end_inset be the value of the likelihood function in which the "slope" coefficient \begin_inset Formula $b_{1}$ \end_inset (or other coefficients if they are in the model) is 0. Hence, the \begin_inset Formula $L_{0}$ \end_inset is the maximized likelihood when only a constant, \begin_inset Formula $b_{0}$ \end_inset , is estimated. \end_layout \begin_layout Standard B. Let \begin_inset Formula $L_{max}$ \end_inset be the value of the likelihood function at its maximum, when all coefficients, the slope and the intercept, are estimated to maximize the likelihood. \end_layout \begin_layout Standard C. Let \begin_inset Formula $\lambda$ \end_inset , (Greek \begin_inset Quotes eld \end_inset lambda \begin_inset Quotes erd \end_inset ), be the ratio of \begin_inset Formula $L_{0}$ \end_inset to \begin_inset Formula $L_{max}$ \end_inset : \end_layout \begin_layout Standard \begin_inset Formula \begin{equation} \lambda\quad=\frac{L_{0}}{L_{max.}}\end{equation} \end_inset \end_layout \begin_layout Standard D. It can be shown that \begin_inset Formula $-2\cdot\ln(\lambda)$ \end_inset has a \begin_inset Formula $\chi^{2}$ \end_inset distribution with \begin_inset Formula $k$ \end_inset degrees of freedom, where \begin_inset Formula $k$ \end_inset is the number of \begin_inset Quotes eld \end_inset slope \begin_inset Quotes erd \end_inset coefficients you estimated (equivalently, the difference in the number of coefficients estimated in calculating \begin_inset Formula $L_{0}$ \end_inset versus \begin_inset Formula $L_{max}$ \end_inset ). \end_layout \end_deeper \begin_layout Section Cumulative Probability Interpretation (approach # 2) \end_layout \begin_layout Standard In the dichotomous case, one simply has a predictive statement that says \end_layout \begin_layout Standard \begin_inset Formula \begin{equation} y_{i}=\left\{ \begin{array}{ll} 1 & ifZ_{i}=b_{0}+b_{1}X_{i}-e_{i}>0\\ 0 & ifZ_{i}=b_{0}+b_{1}X_{i}-e_{i}\leq0\end{array}\right.\label{eq:ydefined}\end{equation} \end_inset \newline Think of a world in which there is an underlying variable, say \begin_inset Formula $Z_{i}$ \end_inset . If this underlying variable exceeds a threshold of \begin_inset Formula $0$ \end_inset , then \begin_inset Formula $Y_{i}=1$ \end_inset . If \begin_inset Formula $Z_{i}$ \end_inset is less than 0, then \begin_inset Formula $Y_{i}$ \end_inset takes on the value of 0. If you let \begin_inset Formula $Z_{i}$ \end_inset equal the linear function above, the story is complete. \end_layout \begin_layout Standard And so the probability that \begin_inset Formula $y_{i}=1$ \end_inset is the same as the probability that \end_layout \begin_layout Standard \begin_inset Formula \begin{equation} e_{i}