#LyX 1.4.3 created this file. For more info see http://www.lyx.org/ \lyxformat 245 \begin_document \begin_header \textclass literate-article \begin_preamble \usepackage{latexsym} \usepackage{graphicx} \usepackage{psfig} \usepackage{color} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% User specified LaTeX commands. \usepackage{ragged2e} \RaggedRight \setlength{\parindent}{1 em} \end_preamble \language english \inputencoding latin1 \fontscheme times \graphics default \paperfontsize 12 \spacing single \papersize letterpaper \use_geometry true \use_amsmath 1 \cite_engine basic \use_bibtopic false \paperorientation portrait \leftmargin 1in \topmargin 1in \rightmargin 1in \bottommargin 1in \secnumdepth 3 \tocdepth 3 \paragraph_separation indent \defskip medskip \quotes_language english \papercolumns 1 \papersides 1 \paperpagestyle default \tracking_changes false \output_changes true \end_header \begin_body \begin_layout Title Logistic Regression Day 2: Diagnostics and Multi-Category Dependent Variables \end_layout \begin_layout Author Paul E. Johnson \end_layout \begin_layout Section You will find many different treatments in the literature \end_layout \begin_layout Standard \begin_inset LatexCommand \label{sec:sec1} \end_inset \end_layout \begin_layout Standard A key problem for young researchers is that they want to know \begin_inset Quotes eld \end_inset what is the right model \begin_inset Quotes erd \end_inset to use with their data. If someone suggests a logistic regression model, the student can consult many different books and conclude that the whole thing is a mess. \end_layout \begin_layout Standard What should be clear? \end_layout \begin_layout Enumerate OLS is dubious \end_layout \begin_layout Enumerate We are modeling probabilities \end_layout \begin_layout Enumerate There are many different ways to formalize this agenda \end_layout \begin_layout Section Levels of Measurement \end_layout \begin_layout Standard The problems with OLS can be seen differently in the light of the so-called levels of measurement from elementary research design. \end_layout \begin_layout Subsection Nominal, Ordinal, Interval \end_layout \begin_layout Standard \series bold Nominal variable \series default : all measurement information is preserved if one reassigns scores to groups of observations. \end_layout \begin_layout Standard If all observations with an assigned score of \begin_inset Formula $x1$ \end_inset are re-labeled as \begin_inset Formula $x2$ \end_inset , and all observations that were originally assigned \begin_inset Formula $x2$ \end_inset are re-labeled as \begin_inset Formula $x1$ \end_inset , no information is lost. \end_layout \begin_layout Standard Aside from confusion in public restrooms, it would make no difference if all \begin_inset Quotes eld \end_inset men \begin_inset Quotes erd \end_inset were somehow magically re-labeled as \begin_inset Quotes eld \end_inset women \begin_inset Quotes erd \end_inset , and vice versa. The differentiation of the scores is preserved by rescoring. \end_layout \begin_layout Standard In all models that treat nominal variables, it is important that the results of research must not depend on the particular numerical score that is assigned. If one codes men as \begin_inset Formula $100$ \end_inset and women as \begin_inset Formula $200$ \end_inset , the analysis should find the same conclusion as one in which men are coded \begin_inset Formula $-111$ \end_inset and women are coded \begin_inset Formula $555$ \end_inset . \end_layout \begin_layout Standard \series bold Ordinal variable \series default : if 2 case are observed and assigned scores \begin_inset Formula $x10$ \end_inset and any \begin_inset Formula $\alpha$ \end_inset . \end_layout \begin_layout Standard \begin_inset Formula \begin{equation} \alpha+\beta*x1<\alpha+\beta*x2\label{eq:interval2}\end{equation} \end_inset \end_layout \begin_layout Standard Note that the gap between the scores is exactly proportional to \begin_inset Formula $\beta$ \end_inset . Originally, the gap between them is \end_layout \begin_layout Standard \begin_inset Formula \begin{equation} x2-x1\label{eq:interval3}\end{equation} \end_inset \end_layout \begin_layout Standard and the difference between the new scores is \end_layout \begin_layout Standard \begin_inset Formula \begin{equation} \{\alpha+\beta*x2\}-\{\alpha+\beta*x1\}=\beta\cdot(x2-x1)\label{eq:interval4}\end{equation} \end_inset \end_layout \begin_layout Standard Interval level is, of course, a very restrictive measurement. Only the interval preserving transformation is allowed. \end_layout \begin_layout Subsection OLS \end_layout \begin_layout Standard Think about ordinary least squares for a moment. We had to start with the theory that \begin_inset Formula \begin{equation} y_{i}=\beta_{0}+\beta_{1}x_{i}+e_{i}\label{eq:ols}\end{equation} \end_inset \end_layout \begin_layout Standard Notice that we give substantive meaning to \begin_inset Formula $\beta_{0}$ \end_inset and \begin_inset Formula $\beta_{1}$ \end_inset and we do so by interpreting them in light of the units of the dependent variable. \end_layout \begin_layout Standard One is, of course, free to re-scale \begin_inset Formula $y_{i}$ \end_inset . Suppose we change a variable so that the new variable \begin_inset Formula $newy_{i}$ \end_inset is 1000 times greater than the old one. That is, we replace values like 444 with 444000. Then your theory has to be \end_layout \begin_layout Standard \begin_inset Formula \begin{equation} newy_{i}=new\beta_{0}+new\beta_{1}x_{i}+newe_{i}\label{eq:ols2}\end{equation} \end_inset \end_layout \begin_layout Standard The theory, of course, is not substantively affected. The regression coefficients will be 1000 times greater, but their standard errors will also be 1000 times greater, the t-tests will be exactly the same, and the \begin_inset Formula $R^{2}$ \end_inset will be exactly the same. There is no substantive damage. Suppose flies are counted in the 1000s. The number of flies (in thousands) in a jar goes up 4.5 for each cubic millimete r of sugar. If the dependent variable is re-scaled so that it represents flies, then the number of flies goes up 4500 per cubic millimeter of sugar. Either way, we are talking about the same number of flies. \end_layout \begin_layout Standard In contrast, if you make other kinds of re-scalings of \begin_inset Formula $y_{i}$ \end_inset , then you don't necessarily preserve intervals, and thus you cause a substantiv e change in the interpretation of the coefficients. \end_layout \begin_layout Section \end_layout \begin_layout Standard \end_layout \begin_layout Section Extending the model to deal with Ordinal Dependent Variables \end_layout \begin_layout Subsection Review of the cumulative probability interpretation \end_layout \begin_layout Standard In order to treat ordinal dependent variables, one must follow the second approach to logit models spelled out on my first handout. I called that the \begin_inset Quotes eld \end_inset cumulative probability interpretation. \begin_inset Quotes erd \end_inset So please review that. \end_layout \begin_layout Standard Suppose \begin_inset Formula $y_{i}$ \end_inset can have \begin_inset Formula $3$ \end_inset values, \begin_inset Formula $0$ \end_inset , \begin_inset Formula $1$ \end_inset , and \begin_inset Formula $2$ \end_inset . Assume that there are \begin_inset Quotes eld \end_inset thresholds \begin_inset Quotes erd \end_inset or \begin_inset Quotes eld \end_inset cutoff points \begin_inset Quotes erd \end_inset , \begin_inset Formula $\Pi$ \end_inset that separate the observations: \end_layout \begin_layout Standard \begin_inset Formula \begin{equation} y_{i}=\left\{ \begin{array}{lll} 2 & if\, b_{0}+b_{1}X_{i}-e_{i}\geq\Pi_{1}\\ 1 & if\,\Pi_{0}\leq b_{0}+b_{1}X_{i}-e_{i}<\Pi_{1}\\ 0 & if\, b_{0}+b_{1}X_{i}-e_{i}<\Pi_{0}\end{array}\right.\label{eq:3category1}\end{equation} \end_inset \end_layout \begin_layout Standard You can use any distribution for \begin_inset Formula $e_{i}$ \end_inset that you like, but the computational challenges cause many people to prefer the logistic distribution. As in the dichotomous case, the probabilities of the various outcomes are calculated by use of cumulative probability. \end_layout \begin_layout Standard I always get confused about the signs of the error terms and the thresholds. It seems to me that no 2 books use the same terminology and style and if you estimate these things with 2 programs, you are just as likely to find the thresholds estimated as positives or negatives, or as intercepts for the particular values of \begin_inset Formula $y_{i}$ \end_inset . The following is equivalent to expression \begin_inset LatexCommand \ref{eq:3category1} \end_inset . \end_layout \begin_layout Standard \begin_inset Formula \begin{equation} y_{i}=\left\{ \begin{array}{lll} P(y_{i}=2) & =P(e_{i}\leq b_{0}+b_{1}X_{i}-\Pi_{1}) & =\Phi(b_{0}+b_{1}X_{i}-\Pi_{1})\\ P(y_{i}=1) & =P(b_{0}+b_{1}X_{1}-\Pi_{1}\leq e_{i}b_{0}+b_{1}X_{i}-\Pi_{0}) & =1-\Phi(b_{0}+b_{1}X_{i}-\Pi_{0})\end{array}\right.\label{eq:3category2}\end{equation} \end_inset \end_layout \begin_layout Standard An illustration of this is presented in Figure \begin_inset LatexCommand \ref{cap:Ordinal-Logit} \end_inset .. \end_layout \begin_layout Standard \begin_inset Float figure wide false sideways false status open \begin_layout Caption Ordinal Logit \begin_inset LatexCommand \label{cap:Ordinal-Logit} \end_inset \end_layout \begin_layout Standard \begin_inset VSpace 1in \end_inset \end_layout \begin_layout Standard \align center \begin_inset Include \input{cumulative2.pstex_t} preview true \end_inset \end_layout \end_inset \end_layout \begin_layout Standard Please note a potential source of confusion. I have used this notation as an extension of my notes on the dichotomous dependent variable. I have a minus sign in the expression for the \begin_inset Quotes eld \end_inset explanatory part \begin_inset Quotes erd \end_inset \begin_inset Formula $b_{0}+b_{1}X_{i}-e_{i}$ \end_inset because it seemed easier to me when referring to figures. Unfortunately, in the multicategory case, the figure is a little bit \begin_inset Quotes eld \end_inset backwards \begin_inset Quotes erd \end_inset because the sections for the categories count up from left to right. And the \begin_inset Quotes eld \end_inset threshold \begin_inset Quotes erd \end_inset coefficients have minus signs. Because of small wrinkles like this, the threshold coefficients and intercepts estimated in Logistic regressions should be carefully scrutinized. No two programs seem to give the exact same results. Oh, well. \end_layout \begin_layout Standard Please note that the constant \begin_inset Formula $b_{0}$ \end_inset and the coefficient \begin_inset Formula $\Pi_{o}$ \end_inset cannot be separately estimated. Some computer programs will eliminate the constant, and just estimate two threshold parameters, while some programs will eliminate the first threshold, and estimate one constant and the other threshold. And some programs get rid of the threshold idea altogether and just estimate two separate constants, one for each of the first 2 categorical outcomes. \end_layout \begin_layout Standard \begin_inset ERT status collapsed \begin_layout Standard \backslash bigskip \end_layout \end_inset \end_layout \begin_layout Section Multinomial Logit model \end_layout \begin_layout Standard Suppose the dependent variable is truly nominal. Then one cannot use the continuous probability distribution as the \begin_inset Quotes eld \end_inset engine \begin_inset Quotes erd \end_inset to drive the transition to multi categories. Instead, we need some model to predict among several unordered categories. \end_layout \begin_layout Standard Note the terminology problem that some political scientists refer to an ordinal logit model as a multinomial logit model, which is just wrong and confusing. \end_layout \begin_layout Standard The Multinomial model begins with a concession. Suppose, for each possible outcome j, we had a predictive model: \end_layout \begin_layout Standard \begin_inset Formula \begin{equation} Pj=\frac{e^{b_{j}x}}{\sum_{j=1}^{m}e^{b_{j}x}}\label{eq:mnl1}\end{equation} \end_inset \end_layout \begin_layout Standard m is the number of different observed values. And we hypothesize that the probability is given by this ratio, in which for each value there is a vector of coefficients, \begin_inset Formula $b_{j}$ \end_inset \end_layout \begin_layout Standard For a variety of reasons, some of which I could explain if I had time, it is not possible to estimate such a big model. Part of the problem is that, with probabilities, one can specify only m-1 probabilities, and then the last category is logically required to equal \begin_inset Formula $1-P_{1}-P_{2}...-Pm$ \end_inset . So, although the theory says there are \begin_inset Formula $m$ \end_inset sets of coefficients, actually there are \begin_inset Formula $m-1$ \end_inset sets, and the last can be logically deduced. \end_layout \begin_layout Standard As a result of this limitation, people who use the MNL model are forced to make a simplification. Instead of estimating the full,hypothesized model, they instead estimate a model that sets one outcome as the \begin_inset Quotes eld \end_inset baseline \begin_inset Quotes erd \end_inset outcome and then we estimate the factors that differentiate the other outcomes from that \begin_inset Quotes eld \end_inset baseline \begin_inset Quotes erd \end_inset . By custom, if the outcomes are numbered 1, 2, 3, ..., the baseline is category 1. In the MNL model, then, with 3 categories, we really only need to estimate 2 models. Let \begin_inset Formula $P_{j}$ \end_inset represent the probability that a score will fall into the \begin_inset Formula $j$ \end_inset 'th category. (I'm not writing in the subscript \begin_inset Formula $i$ \end_inset for individual cases, don't get excited) \end_layout \begin_layout Standard \begin_inset Formula \begin{equation} ln\left[\frac{P_{2}}{P_{1}}\right]=b_{0.12}+b_{1.12}X_{i}\label{eq:mnl3a}\end{equation} \end_inset and \begin_inset Formula \begin{equation} ln\left[\frac{P_{3}}{P_{1}}\right]=b_{0.13}+b_{1.13}X_{i}\label{eq:mnl3b}\end{equation} \end_inset \end_layout \begin_layout Standard The subscripts are awful, aren't they? I'm not completely happy with my notation. From these, it is logically possible to deduce \begin_inset Formula $\frac{P_{3}}{P_{2}}$ \end_inset \end_layout \begin_layout Standard I realize as I write this that I've not explained the transition from the theory to this set of estimating equations, but will try later... \end_layout \end_body \end_document