#LyX 1.4.3 created this file. For more info see http://www.lyx.org/ \lyxformat 245 \begin_document \begin_header \textclass literate-article \begin_preamble \usepackage{latexsym} \usepackage{graphicx} \usepackage{psfig} \usepackage{color} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% User specified LaTeX commands. \usepackage{ragged2e} \RaggedRight \setlength{\parindent}{1 em} \end_preamble \language english \inputencoding auto \fontscheme times \graphics default \paperfontsize 12 \spacing single \papersize default \use_geometry true \use_amsmath 1 \cite_engine basic \use_bibtopic false \paperorientation portrait \leftmargin 1in \topmargin 1in \rightmargin 1in \bottommargin 1in \secnumdepth 3 \tocdepth 3 \paragraph_separation indent \defskip medskip \quotes_language english \papercolumns 1 \papersides 1 \paperpagestyle default \tracking_changes false \output_changes true \end_header \begin_body \begin_layout Title Multi-Category Dependent Variables \end_layout \begin_layout Author Paul E. Johnson \end_layout \begin_layout Standard \begin_inset LatexCommand \tableofcontents{} \end_inset \end_layout \begin_layout Section Levels of Measurement \end_layout \begin_layout Standard The problems with OLS can be seen differently in the light of the so-called levels of measurement from elementary research design. \end_layout \begin_layout Subsection Nominal, Ordinal, Interval \end_layout \begin_layout Standard \series bold Nominal variable \series default : all measurement information is preserved if one reassigns scores to groups of observations. \end_layout \begin_layout Standard If all observations with an assigned score of \begin_inset Formula $x1$ \end_inset are re-labeled as \begin_inset Formula $x2$ \end_inset , and all observations that were originally assigned \begin_inset Formula $x2$ \end_inset are re-labeled as \begin_inset Formula $x1$ \end_inset , no information is lost. \end_layout \begin_layout Standard Aside from confusion in public restrooms, it would make no difference if all \begin_inset Quotes eld \end_inset men \begin_inset Quotes erd \end_inset were somehow magically re-labeled as \begin_inset Quotes eld \end_inset women \begin_inset Quotes erd \end_inset , and vice versa. The differentiation of the scores is preserved by rescoring. \end_layout \begin_layout Standard In all models that treat nominal variables, it is important that the results of research must not depend on the particular numerical score that is assigned. If one codes men as \begin_inset Formula $100$ \end_inset and women as \begin_inset Formula $200$ \end_inset , the analysis should find the same conclusion as one in which men are coded \begin_inset Formula $-111$ \end_inset and women are coded \begin_inset Formula $555$ \end_inset . \end_layout \begin_layout Standard \series bold Ordinal variable \series default : if 2 case are observed and assigned scores \begin_inset Formula $x10$ \end_inset and any \begin_inset Formula $\alpha$ \end_inset . \end_layout \begin_layout Standard \begin_inset Formula \begin{equation} \alpha+\beta*x1<\alpha+\beta*x2\label{eq:interval2}\end{equation} \end_inset \end_layout \begin_layout Standard Note that the gap between the scores is exactly proportional to \begin_inset Formula $\beta$ \end_inset . Originally, the gap between them is \end_layout \begin_layout Standard \begin_inset Formula \begin{equation} x2-x1\label{eq:interval3}\end{equation} \end_inset \end_layout \begin_layout Standard and the difference between the new scores is \end_layout \begin_layout Standard \begin_inset Formula \begin{equation} \{\alpha+\beta*x2\}-\{\alpha+\beta*x1\}=\beta\cdot(x2-x1)\label{eq:interval4}\end{equation} \end_inset \end_layout \begin_layout Standard Interval level is, of course, a very restrictive measurement. Only the interval preserving transformation is allowed. \end_layout \begin_layout Subsection OLS \end_layout \begin_layout Standard Think about ordinary least squares for a moment. We had to start with the theory that \begin_inset Formula \begin{equation} y_{i}=\beta_{0}+\beta_{1}x_{i}+e_{i}\label{eq:ols}\end{equation} \end_inset \end_layout \begin_layout Standard Notice that we give substantive meaning to \begin_inset Formula $\beta_{0}$ \end_inset and \begin_inset Formula $\beta_{1}$ \end_inset and we do so by interpreting them in light of the units of the dependent variable. \end_layout \begin_layout Standard One is, of course, free to re-scale \begin_inset Formula $y_{i}$ \end_inset . Suppose we change a variable so that the new variable \begin_inset Formula $newy_{i}$ \end_inset is 1000 times greater than the old one. That is, we replace values like 444 with 444000. Then your theory has to be \end_layout \begin_layout Standard \begin_inset Formula \begin{equation} newy_{i}=new\beta_{0}+new\beta_{1}x_{i}+newe_{i}\label{eq:ols2}\end{equation} \end_inset \end_layout \begin_layout Standard The theory, of course, is not substantively affected. The regression coefficients will be 1000 times greater, but their standard errors will also be 1000 times greater, the t-tests will be exactly the same, and the \begin_inset Formula $R^{2}$ \end_inset will be exactly the same. There is no substantive damage. Suppose flies are counted in the 1000s. The number of flies (in thousands) in a jar goes up 4.5 for each cubic millimete r of sugar. If the dependent variable is re-scaled so that it represents flies, then the number of flies goes up 4500 per cubic millimeter of sugar. Either way, we are talking about the same number of flies. \end_layout \begin_layout Standard In contrast, if you make other kinds of re-scalings of \begin_inset Formula $y_{i}$ \end_inset , then you don't necessarily preserve intervals, and thus you cause a substantiv e change in the interpretation of the coefficients. \end_layout \begin_layout Scrap \end_layout \begin_layout Section Extending the model to deal with Ordinal Dependent Variables \end_layout \begin_layout Subsection Review of the cumulative probability interpretation \end_layout \begin_layout Standard In order to treat ordinal dependent variables, one must follow the second approach to logit models spelled out on my first handout. I called that the \begin_inset Quotes eld \end_inset cumulative probability interpretation. \begin_inset Quotes erd \end_inset So please review that. \end_layout \begin_layout Standard Suppose \begin_inset Formula $y_{i}$ \end_inset can have \begin_inset Formula $3$ \end_inset values, \begin_inset Formula $0$ \end_inset , \begin_inset Formula $1$ \end_inset , and \begin_inset Formula $2$ \end_inset . Assume that there are \begin_inset Quotes eld \end_inset thresholds \begin_inset Quotes erd \end_inset or \begin_inset Quotes eld \end_inset cutoff points \begin_inset Quotes erd \end_inset , \begin_inset Formula $\Pi$ \end_inset that separate the observations: \end_layout \begin_layout Standard \begin_inset Formula \begin{equation} y_{i}=\left\{ \begin{array}{lll} 2 & if\, b_{0}+b_{1}X_{i}-e_{i}\geq\Pi_{1}\\ 1 & if\,\Pi_{0}\leq b_{0}+b_{1}X_{i}-e_{i}<\Pi_{1}\\ 0 & if\, b_{0}+b_{1}X_{i}-e_{i}<\Pi_{0}\end{array}\right.\label{eq:3category1}\end{equation} \end_inset \end_layout \begin_layout Standard You can use any distribution for \begin_inset Formula $e_{i}$ \end_inset that you like, but the computational challenges cause many people to prefer the logistic distribution. As in the dichotomous case, the probabilities of the various outcomes are calculated by use of cumulative probability. \end_layout \begin_layout Standard I always get confused about the signs of the error terms and the thresholds. It seems to me that no 2 books use the same terminology and style and if you estimate these things with 2 programs, you are just as likely to find the thresholds estimated as positives or negatives, or as intercepts for the particular values of \begin_inset Formula $y_{i}$ \end_inset . The following is equivalent to expression \begin_inset LatexCommand \ref{eq:3category1} \end_inset . \end_layout \begin_layout Standard \begin_inset Formula \begin{equation} y_{i}=\left\{ \begin{array}{lll} P(y_{i}=2) & =P(e_{i}\leq b_{0}+b_{1}X_{i}-\Pi_{1}) & =\Phi(b_{0}+b_{1}X_{i}-\Pi_{1})\\ P(y_{i}=1) & =P(b_{0}+b_{1}X_{1}-\Pi_{1}\leq e_{i}b_{0}+b_{1}X_{i}-\Pi_{0}) & =1-\Phi(b_{0}+b_{1}X_{i}-\Pi_{0})\end{array}\right.\label{eq:3category2}\end{equation} \end_inset \end_layout \begin_layout Standard An illustration of this is presented in Figure \begin_inset LatexCommand \ref{cap:Ordinal-Logit} \end_inset .. \end_layout \begin_layout Standard \begin_inset Float figure wide false sideways false status open \begin_layout Caption Ordinal Logit \begin_inset LatexCommand \label{cap:Ordinal-Logit} \end_inset \end_layout \begin_layout Standard \begin_inset VSpace 1in \end_inset \end_layout \begin_layout Standard \align center \begin_inset Include \input{/home/pauljohn/ps/ps707/LogisticRegression/cumulative2.pstex_t} preview true \end_inset \end_layout \end_inset \end_layout \begin_layout Standard Please note a potential source of confusion. I have used this notation as an extension of my notes on the dichotomous dependent variable. I have a minus sign in the expression for the \begin_inset Quotes eld \end_inset explanatory part \begin_inset Quotes erd \end_inset \begin_inset Formula $b_{0}+b_{1}X_{i}-e_{i}$ \end_inset because it seemed easier to me when referring to figures. Unfortunately, in the multicategory case, the figure is a little bit \begin_inset Quotes eld \end_inset backwards \begin_inset Quotes erd \end_inset because the sections for the categories count up from left to right. And the \begin_inset Quotes eld \end_inset threshold \begin_inset Quotes erd \end_inset coefficients have minus signs. Because of small wrinkles like this, the threshold coefficients and intercepts estimated in Logistic regressions should be carefully scrutinized. No two programs seem to give the exact same results. Oh, well. \end_layout \begin_layout Standard Please note that the constant \begin_inset Formula $b_{0}$ \end_inset and the coefficient \begin_inset Formula $\Pi_{o}$ \end_inset cannot be separately estimated. Some computer programs will eliminate the constant, and just estimate two threshold parameters, while some programs will eliminate the first threshold, and estimate one constant and the other threshold. And some programs get rid of the threshold idea altogether and just estimate two separate constants, one for each of the first 2 categorical outcomes. \end_layout \begin_layout Standard \begin_inset ERT status collapsed \begin_layout Standard \backslash bigskip \end_layout \end_inset \end_layout \begin_layout Section Bring in data from the ANES 2002 survey \end_layout \begin_layout Standard R's package \begin_inset Quotes eld \end_inset foreign \begin_inset Quotes erd \end_inset has a number of procedures to import data from other packages. My experience is that the importation of SPSS and Stata datasets is quite effective, but SAS datasets are a little unpredictable. Happily, this data imports correctly. \end_layout \begin_layout Scrap <>= \newline library(foreign) \newline nes2002 <- read.xport("/home/pauljohn/ps/ps707/Logist icRegression/PJTEST.sasxport") \newline bushvote <- nes2002$V023111 \newline bushvote[bushvote>3] <- NA \newline bushvote[bushvote==0] <- NA \newline bushvote[bushvote==1] <- 0 \newline bushvote[bushvote== 3] <- 1 \newline democ <- ifelse(nes2002$V023036==1,1,0) \newline repub <- ifelse(nes2002$V023036== 2,1,0) \newline @ \end_layout \begin_layout Section Ordinal Logistic Model \end_layout \begin_layout Standard V023027 H1. US Economy Better/Worse in Last Yr \end_layout \begin_layout Scrap <<>>= \newline table (nes2002$V023027,nes2002$V023022) \newline prop.table (table(nes2002$V023027,nes 2002$V023022),margin=2) \newline @ \end_layout \begin_layout Subsection Ordinal logistic regression with polr \end_layout \begin_layout Standard Proportional odds ordinal logistic, dependent variable=state of econ, independen t variable=ideology \end_layout \begin_layout Scrap <>= \newline # \newline library (MASS) \newline polr1 <- polr(as.factor(V023027)~V023022,data=nes2002) \newline summary(polr1) \newline @ \end_layout \begin_layout Standard In Modern Applied Statistics with S+, Venables and Ripley (p. 204) discuss the formalization behind polr. They say the response variable has \begin_inset Formula $K$ \end_inset levels and the probability model is \begin_inset Formula \[ logit\, P(Y_{i}\le k|X_{i})=\zeta_{k}-(b_{o}+b_{1}X_{i})\] \end_inset \newline Notice the sign there, which is the opposite of my previous handouts. They are saying that the probability that \begin_inset Formula $Y_{i}$ \end_inset will be \begin_inset Formula $k$ \end_inset or lower is equal to the probability that \begin_inset Formula $\zeta_{k}$ \end_inset exceeds the linear predictor \begin_inset Formula $(b_{0}+b_{1}X_{i})$ \end_inset . \end_layout \begin_layout Standard The signs get all messed up because everybody who writes a book or program is free to write down the theoretical model in the way he/she likes. So the cautionary tale is, when you use somebody's computer program for one of these models, it is vital that you have access to their manual/book/arti cle that explains how they are using the terminology. Without it, you are sunk! \end_layout \begin_layout Standard (Recall the problem that the signs of PROC LOGISTIC are the negatives of the signs you get for parameters in any other programs, because in PROC LOGISTIC they have the inequality facing the other way.) \end_layout \begin_layout Subsection Ordinal logistic with lrm \end_layout \begin_layout Standard Note lrm gives the same parameter estimates, \emph on except threshold signs are reversed \emph default . The sign is a matter of whether you view it as a \begin_inset Quotes eld \end_inset threshold \begin_inset Quotes erd \end_inset that must be exceeded or as an \begin_inset Quotes eld \end_inset intercept \begin_inset Quotes erd \end_inset which is added. \end_layout \begin_layout Standard Don't forget to load the Design library with \end_layout \begin_layout Standard >library(Design) \end_layout \begin_layout Scrap <>= \newline library(Design) \newline @ \end_layout \begin_layout Scrap <>= \newline ordlrm1 <- lrm(as.factor(V023027)~V023022,data=nes2002) \newline ordlrm1 \newline anova(or dlrm1) \newline @ \end_layout \begin_layout Scrap <>= \newline ordlrm2 <- lrm(as.factor(V023027)~V023022+repub+democ+V023027+V023131, data=nes2002) \newline ordlrm2 \newline anova(ordlrm2) \newline @ \end_layout \begin_layout Standard OUCH! \end_layout \begin_layout Scrap <>= \newline ordlrm2 <- lrm(as.factor(V023027)~V023022+repub+democ+V023027+V023131, data=nes2002,maxit=500,trace=T) \newline ordlrm2 \newline anova(ordlrm2) \newline @ \end_layout \begin_layout Section Multinomial Logit model \end_layout \begin_layout Standard Suppose the dependent variable is truly nominal. Then one cannot use the continuous probability distribution as the \begin_inset Quotes eld \end_inset engine \begin_inset Quotes erd \end_inset to drive the transition to multi categories. Instead, we need some model to predict among several unordered categories. \end_layout \begin_layout Standard Note the terminology problem that some political scientists refer to an ordinal logit model as a multinomial logit model, which is just wrong and confusing. \end_layout \begin_layout Standard The Multinomial model begins with a concession. Suppose, for each possible outcome j, we had a predictive model: \end_layout \begin_layout Standard \begin_inset Formula \begin{equation} Pj=\frac{e^{b_{j}x}}{\sum_{j=1}^{m}e^{b_{j}x}}\label{eq:mnl1}\end{equation} \end_inset \end_layout \begin_layout Standard m is the number of different observed values. And we hypothesize that the probability is given by this ratio, in which for each value there is a vector of coefficients, \begin_inset Formula $b_{j}$ \end_inset \end_layout \begin_layout Standard For a variety of reasons, some of which I could explain if I had time, it is not possible to estimate such a big model. Part of the problem is that, with probabilities, one can specify only m-1 probabilities, and then the last category is logically required to equal \begin_inset Formula $1-P_{1}-P_{2}...-Pm$ \end_inset . So, although the theory says there are \begin_inset Formula $m$ \end_inset sets of coefficients, actually there are \begin_inset Formula $m-1$ \end_inset sets, and the last can be logically deduced. \end_layout \begin_layout Standard As a result of this limitation, people who use the MNL model are forced to make a simplification. Instead of estimating the full,hypothesized model, they instead estimate a model that sets one outcome as the \begin_inset Quotes eld \end_inset baseline \begin_inset Quotes erd \end_inset outcome and then we estimate the factors that differentiate the other outcomes from that \begin_inset Quotes eld \end_inset baseline \begin_inset Quotes erd \end_inset . By custom, if the outcomes are numbered 1, 2, 3, ..., the baseline is category 1. In the MNL model, then, with 3 categories, we really only need to estimate 2 models. Let \begin_inset Formula $P_{j}$ \end_inset represent the probability that a score will fall into the \begin_inset Formula $j$ \end_inset 'th category. (I'm not writing in the subscript \begin_inset Formula $i$ \end_inset for individual cases, don't get excited) \end_layout \begin_layout Standard \begin_inset Formula \begin{equation} ln\left[\frac{P_{2}}{P_{1}}\right]=b_{0.12}+b_{1.12}X_{i}\label{eq:mnl3a}\end{equation} \end_inset and \begin_inset Formula \begin{equation} ln\left[\frac{P_{3}}{P_{1}}\right]=b_{0.13}+b_{1.13}X_{i}\label{eq:mnl3b}\end{equation} \end_inset \end_layout \begin_layout Standard The subscripts are awful, aren't they? I'm not completely happy with my notation. From these, it is logically possible to deduce \begin_inset Formula $\frac{P_{3}}{P_{2}}$ \end_inset \end_layout \begin_layout Standard I realize as I write this that I've not explained the transition from the theory to this set of estimating equations, but will try later... \end_layout \begin_layout Section Multinomial model (unordered dependent variable) \end_layout \begin_layout Standard I don't know why you would think so, but suppose you wanted to drop the assumption that the economic evaluation is ordinal. Then a multinomial model is the right choice. \end_layout \begin_layout Standard In this example, the model runs 100 iterations and quits without converging. Not a great sign! \end_layout \begin_layout Scrap <>= \newline library(nnet) \newline econmult1 <- multinom(as.factor(V023027)~V023022+repub+de moc+V023027+V023131,data=nes2002,Hess=T) \newline summary(econmult1) \newline @ \end_layout \begin_layout Scrap <>= \newline library(nnet) \newline econmult2 <- multinom(as.factor(V023027)~V023022+repub+de moc+V023027+V023131,data=nes2002,Hess=T,maxit=500) \newline summary(econmult2) \newline @ \end_layout \begin_layout Subsection Vote is multinomial (if you have a big enough sample) \end_layout \begin_layout Scrap <>= \newline library(nnet) \newline vote <- nes2002$V023111 \newline table(vote) \newline vote[vote>5] <- NA \newline votemn1 <- multinom(vote~V023022+repub+V023027+ V023131,data=nes2002) \newline summary( votemn1) \newline @ \end_layout \begin_layout Scrap <>= \newline library(nnet) \newline vote <- nes2002$V023111 \newline table(vote) \newline vote[vote>5] <- NA \newline votemn2 <- multinom(vote~V023022+repub+V023027+ V023131,data=nes2002) \newline summary( votemn2) \newline @ \end_layout \begin_layout Standard #ideology \end_layout \begin_layout Standard #note there is trouble with this model because of the variable repub \end_layout \begin_layout Standard #compare against glm5 summary(bushideoglm6) \end_layout \begin_layout Standard #summary(bushideoglm5) \end_layout \end_body \end_document