#LyX 1.3 created this file. For more info see http://www.lyx.org/ \lyxformat 221 \textclass article \language english \inputencoding auto \fontscheme times \graphics default \paperfontsize 12 \spacing single \papersize Default \paperpackage a4 \use_geometry 1 \use_amsmath 0 \use_natbib 0 \use_numerical_citations 0 \paperorientation portrait \leftmargin 1in \topmargin 1in \rightmargin 1in \bottommargin 1in \secnumdepth 3 \tocdepth 3 \paragraph_separation indent \defskip medskip \quotes_language english \quotes_times 2 \papercolumns 1 \papersides 1 \paperpagestyle default \layout Standard Paul Johnson \layout Standard POLS 707 Research Methods II \layout Standard \added_space_top medskip Top 13 Things worth knowing about R. \layout Enumerate S books are R books. \newline \newline S is a statistical language developed at Bell Labs (R.A. Becker, J.M. Chambers, A.R. Wilks, \emph on The New S Language. \emph default Pacific Grove, CA: Wadsworth & Brooks/Cole, 1988) and a computer program that speaks S, called \begin_inset Quotes eld \end_inset S \begin_inset Quotes erd \end_inset (or, more recently, \begin_inset Quotes eld \end_inset S+ \begin_inset Quotes erd \end_inset ) is for sale from Mathsoft, Inc. \newline \newline R is a statistical language very much like S, better in some ways, and it is available for free and in open source code. R is created/developed by a community of statisticians and programmers who truly desire to make good tools available for research and escape the binds imposed by commercialization of software. \newline \newline Books about S generally will also be applicable to R. Here are some of the really useful ones of which I have copies. \newline \newline W.N. Venables and B.D. Ripley, \emph on Modern Applied Statistics with S \emph default , 4th edition. Brian Ripley is an amazing guy. He must work 20 hours per day. He is one of the primary developers behind the R movement and he reads the r-help email list. If you ever touch on some problem deep down in the guts of R, he may be the only one who will really know the answer. \newline \newline Peter Dalgaard, \begin_inset Quotes eld \end_inset Introductory Statistics with R \begin_inset Quotes erd \end_inset . Peter Daalgard is very active in r-help and also he is a very pleasant person and a good writer. This book is very well done. Not so advanced as V&R, but much more understandable to your average user. This shows how to complete many \begin_inset Quotes eld \end_inset garden variety \begin_inset Quotes erd \end_inset tasks. \newline \newline Frank E. Harrell, Jr. \emph on Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis. \emph default This book has great practical information about how to do projects. He has complete code and working examples. Especially important if you want to use Harrell's advanced R packages, Hmisc and Design. Frank Harrell is (like me) a \begin_inset Quotes eld \end_inset refugee \begin_inset Quotes erd \end_inset from SAS and S+. He was the one who wrote the first \begin_inset Quotes eld \end_inset Proc Logistic \begin_inset Quotes erd \end_inset that was available in SAS in the early 1980s. In the 1990s he was working in S and S+. \newline \newline The S-Plus stat manuals are available for free, in pdf format! \newline \begin_inset LatexCommand \url[S-Plus site: ]{http://www.insightful.com/support/documentation.asp?DID=3} \end_inset \newline You especially want the Guide to Statistics, Parts I and II. These have simpler explanations of many models than you will find in other places. The syntax for usage is not exactly the same as R, but they are worth downloadi ng and saving for reference. \layout Enumerate R has built-in help. \begin_deeper \layout Enumerate Procedure specific help. If you want to know more about the \begin_inset Quotes eld \end_inset hist \begin_inset Quotes erd \end_inset procedure, type \newline > help(hist) \newline That \begin_inset Quotes eld \end_inset help \begin_inset Quotes erd \end_inset method gets used so often that they created a short-cut for it: \newline > ?hist \newline and after you look that over, you notice in the bottom of the help page it has examples, and if you want to see the examples, do \newline > example(hist) \newline If the examples all whir by too quickly, use this command to cause R to ask you to hit the return key between pictures: \newline > par(ask=T) \newline and then type your example(hist) command again. \layout Enumerate Get help inside your web browser. Type \newline > help.start() \newline and watch what happens. Your web browser should show you a top level view, which includes links to the \begin_inset Quotes eld \end_inset Introduction to R \begin_inset Quotes erd \end_inset book as well as the manual on data import and export. \newline \newline You will see also there is a FAQ for R in the online docs. The R FAQ is not a list of precise details about how to do certain things, but rather a higher level explanation of the R language and the various platforms on which it is used. \layout Enumerate Type \newline > help.search( \begin_inset Quotes eld \end_inset hist \begin_inset Quotes erd \end_inset ) \newline and watch what happens. This returns a giant list of methods, most of which you don't really want. But some you do. Alternative versions of histogram methods appear. For example, Brian Ripley does not like the default \begin_inset Quotes eld \end_inset hist \begin_inset Quotes erd \end_inset method, so his package MASS includes \begin_inset Quotes eld \end_inset truehist \begin_inset Quotes erd \end_inset . The MASS package's truehist is more readily configurable. \end_deeper \layout Enumerate \added_space_bottom medskip Every R command needs parentheses! \newline Even when you want out of R, you can't just type quit. \newline You type \newline >quit() \newline or, for short, \newline > q() \newline If you type \newline > q \newline R thinks you want to print out the contents of something named \begin_inset Quotes eld \end_inset q \begin_inset Quotes erd \end_inset , but it can't find such a thing, and it tells you so. \layout Enumerate \begin_inset Quotes eld \end_inset Equal to \begin_inset Quotes erd \end_inset means a whole different thing than you expect. \newline If you create a new variable, do not use the equal sign ( \begin_inset Formula $=$ \end_inset ). Instead, you must use the symbol \begin_inset Formula $<-$ \end_inset to \begin_inset Quotes eld \end_inset assign \begin_inset Quotes erd \end_inset a value. So, for example, you can set a constant named \begin_inset Quotes eld \end_inset b \begin_inset Quotes erd \end_inset : \newline > b \begin_inset Formula $<-$ \end_inset 3.3 \newline or define a vector named \begin_inset Quotes eld \end_inset b \begin_inset Quotes erd \end_inset : \newline > b \begin_inset Formula $<-$ \end_inset c(3,2,1,4,1,5,5,1,1,2,) \newline or, if you run some command, like read.table(), you can save its result as a data frame named \begin_inset Quotes eld \end_inset b \begin_inset Quotes erd \end_inset \newline > b \begin_inset Formula $<-$ \end_inset read.table( \begin_inset Quotes eld \end_inset myData.txt \begin_inset Quotes erd \end_inset ,header=T) \newline Note the equal sign is used inside the parentheses. The equal sign is OK there, because it is not making a permanent assignment to some variable. \newline \newline If you study other math-oriented computer packages, such as \emph on Mathematica \emph default , you will find the same kind of distinction between equals and assignment. Its not an R thing in particular. \layout Enumerate Terminology: \newline I only bring these things up because you are likely to get confused if you go into some manual and find that they are using a lot of terms that are unfamiliar to a social scientist. \begin_deeper \layout Enumerate object: a \begin_inset Quotes eld \end_inset thing \begin_inset Quotes erd \end_inset that \begin_deeper \layout Enumerate has variables (one or more): that means it has a \begin_inset Quotes eld \end_inset name \begin_inset Quotes erd \end_inset and a \begin_inset Quotes eld \end_inset value \begin_inset Quotes erd \end_inset for each variable. The value can be a missing value. \layout Enumerate can do \begin_inset Quotes eld \end_inset stuff \begin_inset Quotes erd \end_inset that you ask it to. \end_deeper \layout Enumerate method: the instructions that are sent to objects. \newline \newline I frequently betray my SAS background by calling these things \begin_inset Quotes eld \end_inset procedures \begin_inset Quotes erd \end_inset . \newline \newline In R, the style is different than I was used to from computer languages like Java and Objective-C. (It seems to me that, in R, one thinks of a message sent to an object in a way that is almost exactly the opposite of Java.) If you ask an object to carry out a method, in R you say: \newline \newline > \begin_inset Formula $method(object)$ \end_inset \newline \newline whereas in java you would say: \newline \newline object.method() \newline \layout Enumerate In R, you can do things however you like, but there is a certain recommended style. Most importantly, always remember that the results of methods are always objects. \newline \newline You can run a regression model and watch the output on the screen with this sort of command: \newline \newline lm(y~x) \newline \newline but most people don't do that. Rather, they save the \begin_inset Quotes eld \end_inset regression object \begin_inset Quotes erd \end_inset and then use it: \newline \newline myRegObj <- lm(y~x) \newline summary(myRegObj) \newline \newline summary is the method, which the object myRegObj carries out. The R style looks like a function call, but it is not, really. \newline \newline You see my Objective-C background pop out here because I like to name things verbosely, starting with a small letter, then using capitals to start new elements of the name. \end_deeper \layout Enumerate In R, you can type in lots of stuff. But you should not. \newline \newline Write your commands into a text editor and then run them in R. This way, you have a perfect record of what you have done and you can always reproduce it perfectly. \newline \newline The editor Emacs (and Xemacs) can be customized to work together with R, but I don't think there is a big advantage for most users from doing that. \layout Enumerate R has what our students need. \begin_deeper \layout Enumerate The standard R distribution includes all statistical methods (and many more) that are used in an intermediate course like POLS 707. For linear models, use lm(). For scatterplots, use plot(). For histograms, start with hist() (or truehist()). For logit models, use glm(). For factor analysis, use aov(). \layout Enumerate R includes a general purpose programming language, featuring vectors, matrices, and many other excellent things. There is no limit to the complexity of your projects (except the limits imposed by your patience, intelligence, and access to hardware). \end_deeper \layout Enumerate Factors! \newline \newline R methods are (usually) sensitive to the kinds of variables you ask them to use. They will automatically treat categorical variables differently from continuous variables. \newline \newline Understand this term: \newline \series bold Factor. \series default S/R terminology for a variable that is not continuous. Factors are \begin_inset Quotes eld \end_inset grouping variables \begin_inset Quotes erd \end_inset and they can be ordered (some, lots, all) or unordered (male, female). \newline \newline In the \begin_inset Quotes eld \end_inset old days \begin_inset Quotes erd \end_inset , the statistical software would not try to separate variables that are meaningfully scaled (age, income, etc) from variables that were not. The user had the duty of creating a coding scheme for categorical variables. Remember \begin_inset Quotes eld \end_inset recoding \begin_inset Quotes erd \end_inset simply to force categorical variables into a numerical framework, so they could be put into models? \newline \newline Example 1: dem = 1 if the respondent is a Democrat, 0 otherwise, \newline Example 2: vote = 1 if the respondent voted, 0 otherwise. \newline \newline In R, if you input some variable with values like \begin_inset Quotes eld \end_inset D \begin_inset Quotes erd \end_inset and \begin_inset Quotes eld \end_inset R \begin_inset Quotes erd \end_inset and then you declare it as a factor, then R will \begin_inset Quotes eld \end_inset automagically \begin_inset Quotes erd \end_inset recode it for you (turn it into 0 and 1). There are a few different ways it can automagically recode categorical variables. Look for \begin_inset Quotes eld \end_inset contrasts \begin_inset Quotes erd \end_inset in the documentation. If a model does not make sense, then R will tell you so. \newline \newline Check this out: \newline > sex <- c("M","F","M","F","M","F","M","F" \newline > age <- c(2,3,1,4,2,4,2,1) \newline > #lm(age~sex) returns an error, but this does not \newline > sex <- as.factor(sex) \newline > lm(age~sex) \newline \newline The lm method automagically creates a dummy variable \begin_inset Quotes eld \end_inset sexM \begin_inset Quotes erd \end_inset that is used in the regression model \begin_inset Formula $y=\beta_{0}+\beta_{1}sexM$ \end_inset . \newline \newline Now change sex: \newline > sex <- c("M","F","M","F","M","F","I","I") \newline > sex <- as.factor(sex) \newline > lm(age~sex) \newline \newline That automagically creates 2 dummy variables, \begin_inset Quotes eld \end_inset sexM \begin_inset Quotes erd \end_inset and \begin_inset Quotes eld \end_inset sexI \begin_inset Quotes erd \end_inset and it estimates \begin_inset Formula $y=\beta_{0}+\beta_{1}sexM+\beta_{2}sexI$ \end_inset \newline \newline If you already have the numerical data, you tell R to use it as a factor with the \begin_inset Quotes eld \end_inset factor \begin_inset Quotes erd \end_inset method and then you can set the labels with the \begin_inset Quotes eld \end_inset levels \begin_inset Quotes erd \end_inset method. \newline > sex <- c(0,1,0,1,0,1,0,1) \newline > factorSex <- factor(sex,levels=0:1) \newline > levels(factorSex) <- c("M","F","I") \newline > factorSex \newline [1] M F M F M F M F \newline Levels: M F I \newline \newline I do not know if S was the first language to explicitly build in the cautious treatment of categorical variables, but I'm pretty sure it was before SAS in that regard. Now SAS has the \begin_inset Quotes eld \end_inset CLASS \begin_inset Quotes erd \end_inset option in some procedures. It works like factors in R. \layout Enumerate Graphics: The agony and the extacy. \begin_deeper \layout Enumerate Device. This terminology baffled me for a long time. I tend to think of device as a machine. But that's not what R means by device. Device means an output format. The screen output device x11(). If you want to write the putput in a postscript file, use postscript(). There are devices for jpg, png. In MS Windows, there is a device to save windows meta file format. \layout Enumerate In Rtips I give some details about saving graphs. I think it is a pain, frankly, but here is the story. If you make a graph on the screen, you can't always save it into a file so it looks just the same. To be save, it is necessary to fiddle the graph on the screen, and then turn on an output device that writes a file, and then re-run the graph command. \end_deeper \layout Enumerate There is some stuff in R-base, but there is much much more in R addon packages. If you want to use those packages, you have to explicitly load them with the library() command. \newline \newline R is an \begin_inset Quotes eld \end_inset extensible \begin_inset Quotes erd \end_inset product, meaning that users can freely write additional capability, package it up, and distribute it. When I first heard of R, there was complete \emph on package anarchy \emph default . Now there is a recommended set of packages that the R makers put together with R when it is distributed, and on the R Internet nexus called \begin_inset Quotes eld \end_inset CRAN \begin_inset Quotes erd \end_inset one can find many more packages that work, to varying degrees. \newline \newline Example. \begin_inset Quotes eld \end_inset truehist \begin_inset Quotes erd \end_inset shows up in the output from help.search( \begin_inset Quotes eld \end_inset hist \begin_inset Quotes erd \end_inset ): \newline truehist(MASS) Plot a Histogram \newline This means that the package(MASS) (guess what that's short for) has another histogram method. In order to use that, you must load the package with this command \newline > library(MASS) \newline And then you can read the help page: \newline > ?truehist \newline In that writeup, you see that you get access to many more settings than you get with the ordinary hist method. \newline \newline Here are packages that I will install and will be available in our stat lab \begin_deeper \layout List \labelwidthstring 00.00.0000 car: a companion for applied regression by John Fox, who has published a several textbooks on regression modeling, the most recent of which is from Sage, \emph on R and S+ Companion to Applied Regression \emph default . \layout List \labelwidthstring 00.00.0000 Rcmdr: a gui interface for R, also by John Fox. Type \newline > library(Rcmdr) \newline to start it up. It is being actively updated. I notice it gets better all the time. Also from John Fox. \layout List \labelwidthstring 00.00.0000 mgcv: a package for generalized additive regression and Smoothing models \layout List \labelwidthstring 00.00.0000 Hmisc,Design: packages from Frank Harrell, a biomedical stats professor/research er who previously had written procedures for SAS and S+. \layout Standard I will install other packages if you ask nicely and I agree they are needed, otherwise you can install packages in your private user space. R has documentation for how to do that. \end_deeper \layout Enumerate Model notation in R. The literature on the \begin_inset Quotes eld \end_inset generalized regression model \begin_inset Quotes erd \end_inset introduced a streamlined notation for regression models that is followed in R and some other languages. It is called \begin_inset Quotes eld \end_inset Wilkinson and Rogers \begin_inset Quotes erd \end_inset notation and R's notation is very close to W&R. It is described on p. 75-79 in \emph on An Introduction to R. \emph default If you want to estimate a model: \begin_inset Formula \[ y_{i}=\alpha+\beta_{1}x1_{i}+\beta_{2}x2_{i}\] \end_inset Then the formula notation for that in R is \begin_inset Formula \[ y\sim x1+x2\] \end_inset Typically, the software will assume you want an estimate for the constant ( \begin_inset Formula $\alpha)$ \end_inset and that you want to estimate separate coefficients ( \begin_inset Formula $\beta_{1}$ \end_inset and \begin_inset Formula $\beta_{2})$ \end_inset for varables \begin_inset Formula $x1$ \end_inset and \begin_inset Formula $x2$ \end_inset . \newline \newline Suppose you tell R you want to estimate the formula: \begin_inset Formula \[ y\sim x1*x2\] \end_inset That will estimate a model that includes independent variables \begin_inset Formula $x1$ \end_inset and \begin_inset Formula $x2$ \end_inset as well as the \begin_inset Quotes eld \end_inset interaction \begin_inset Quotes erd \end_inset \begin_inset Formula $x1*x2$ \end_inset : \begin_inset Formula \[ y_{i}=\alpha+\beta_{1}x1_{i}+\beta_{2}x2_{i}+\beta_{3}x1_{i}*x2_{i}\] \end_inset If you happened to enter a formula like \begin_inset Formula $x1*x2*x3*x4*x5,$ \end_inset R would estimate a LOT of coefficients, because you'd get an estimate for each variable, each pair of variables multiplied together, each set of 3 variables multiplied, and so forth. \newline \newline There is another especially convenient element in R. You can use any of the mathematical functions that R has \emph on inside \emph default your formula. For example, a formula like: \newline \newline > logReg <- lm (y ~ x1+log(x2)) \newline \newline will provide estimates for a model: \newline \begin_inset Formula \[ y_{i}=\beta_{0}+\beta_{1}*x1+\beta_{2}*log(x2)\] \end_inset \newline I posted an example program \begin_inset Quotes eld \end_inset easyLogReg.R \begin_inset Quotes erd \end_inset to demonstrate that. You can find it in my R ExampleCode directory \newline http://www.ku.edu/~pauljohn/R/ExampleCode \layout Enumerate With R, you can do many things that other stat frameworks would not allow or facilitate. \begin_deeper \layout Enumerate In Stata and SPSS, you can only open one data set at a time. That's a crippling limitation, in my opinion. \layout Enumerate R is similar to SAS in the sense that one can run a model, save its results into various new datasets, and then do additional work with those new datasets. However, R is orders of magnitude more convenient. Compare the R code to illustrate the Central Limit Theorem with the mean of a Gamma variable \newline http://www.ku.edu/~pauljohn/R/ExampleCode/template_gamma4.R \newline against the SAS code for the same purpose: \newline http://lark.cc.ku.edu/~pauljohn/SASClass/ExampleCode/clt-gamma2.sas \layout Enumerate R has great facilities to handle repetitive tasks. I have run simulations that generate hundreds of output data sets. I have written R code that can systematically open each one, make plots, and calculate results. If I had to do that by hand, I'd be as insane as most Windows users. \end_deeper \layout Enumerate You can join the r-help email list and you can ask for help in there. Heed my warning. Many people will be helpful to you, but if your question betrays a total lack of effort to read the easily available documentation, then they will not be so eager to help. \newline \newline I find the r-help list is most helpful for relatively specific questions about how some specific command R can be used or a specific malfunction that you have observed. Nobody is interested in a general email like \begin_inset Quotes eld \end_inset I can't make linear models work in R \begin_inset Quotes erd \end_inset , but they are eager to help if you ask a specific question about the syntax in the method lm(). They are also eager to point your attention to packages that provide certain functions, but they'll urge you to search on CRAN before asking. For example, if you search on CRAN and see 4 packages that can handle \begin_inset Quotes eld \end_inset multiple imputation for missing data, \begin_inset Quotes erd \end_inset it would be suitable to join r-help and ask if anybody will share their experiences with thes packages. \newline \newline If you want to ask in the r-help email list, you should follow this progression. \begin_deeper \layout Enumerate Run help.search( \begin_inset Quotes eld \end_inset your topic \begin_inset Quotes erd \end_inset ) to see what pops up. \layout Enumerate Consult \begin_inset Quotes eld \end_inset Introduction to R, \begin_inset Quotes erd \end_inset \layout Enumerate Check the Venables & Ripley book. Check the Daalgard book. \layout Enumerate Look through my Rtips collection of usage tidbits. That's what I do, and when I learn something new, I often put it in there. Or at least I did a while ago: \begin_inset LatexCommand \url[Rtips:]{http://www.ku.edu/~pauljohn/R/Rtips.html} \end_inset \layout Enumerate Go to the R email list online archive and search for some key words. Chances are, if you have a question, some other person already had it. You can find about the R mailing lists and their archives on the main R web site: \begin_inset LatexCommand \url{http://www.r-project.org/} \end_inset . If you go to this page, which also keeps email list archives, \newline \begin_inset LatexCommand \url{http://maths.newcastle.edu.au/~rking/R/} \end_inset \newline at the bottom of the page you will find a SEARCH tool that is rather convenient. \the_end e_end