Introduction to R

R is an open source software environment for statistical computing. One of the primary advantages to using R for statistical computing is that it is highly extensible—not only can you create your own functions and packages, but the R community as a whole builds and maintains thousands of such packages. For example, there are several packages designed for latent variable analysis including lavaan, OpenMx, ltm, and sem.

Another primary advantage for using R is that it is a programming language designed for statistical analysis. Using a programming language allows for very low-level customization (the analytical process tends to feel “hands on”) but can also increase the learning curve. In this guide we will examine the very basics of the R language, which are necessary to use the many packages for data analysis in R.

First Steps

Install R

If you have not yet installed R, find it at this link: http://cran.r-project.org

We also have helpful step-by-step guides for installing R on Windows and Mac!

Using installation defaults should generally be sufficient for most users. To install packages (for instance, lavaan), open R and type this at the command prompt:

install.packages(“lavaan”)

Select a CRAN mirror such as University of Kansas when prompted (any in the same continent as your computer should be sufficient). Load the package by typing the following:

library(lavaan)

The R Programming Language

Typically, it is best to learn programming languages by actually programming. So as you read through this guide, do the examples and play around by trying different things on your own. (Note: Type out the examples yourself, as copy/pasting from a PDF to the R terminal may not result in the expected plain-text format of certain characters, such as quotation marks.)

Have some fun! You will find cited at the end of the guide several resources I’ve found helpful (including a reference card). The following links are also basic introductions, but more thorough and extensive than what is provided here:

http://scc.stat.ucla.edu/page_attachments/0000/0141/10S-basicR.pdf
http://www.stat.auckland.ac.nz/~stat782/downloads/01-Basics.pdf

The R Environment

R is primarily an interpreted language, which means that syntax is executed and run through an interpreter (in this case the R console window) rather than compiled into machine code. Because you work in R by simply typing in commands and having them evaluated, it is best to use a text editor to save your commands so you can run, review, and change your code as necessary.

The RGui interface contains a basic text editor which you can open by selecting “New Script” under the “File” menu. From the text editor, you can evaluate lines of your code by hitting Ctrl- R.

The R Interpreter

In interpreted languages, what you type in at the prompt gets evaluated as a computation. When starting R, you are greeted with some text describing the version information, citation instructions, and instructions for help. Below that is the prompt: “>”

>

Expressions typed for computation at the prompt get evaluated immediately. To see this, type the simple mathematical expression 2 + 3 into the interpreter:

> 2 + 3
[1] 5

The result returned is 5.

Note!

In this guide, expressions typed by the user will be in the first (shaded) box, while the result returned by evaluating the expression will be displayed in the second (white) box, prefaced with [#].

Expressions typed by the user (in this guide) will not be prefaced with >. Anything following the pound symbol (or hash mark) “#”, typed by the user, is a comment and is not evaluated.

R Programming Fundamentals

Vectors

Programming languages are built around data, and doing operations on that data. In many programming languages, data is stored in variables—just think of them as boxes. Each box has a name, and you access the contents of the box by referencing its name. In R, the basic data storage device is a vector, which can contain any number of values. You can think of a vector as a stack of boxes, each of which you can access by referencing its name.

A vector can be named any sequence of characters. To assign to (store things in) a vector, you use the assignment key <-, which is obviously 2 characters: < and - (less than and dash).

The following are examples of creating vectors and storing things in them. Access the value you stored by typing the name of the vector you created into the interpreter.

A <- 3
A
[1] 3
B12 <- "a string"
B12
[1] "a string"

Vectors can also store lists of values.

x <- c(1, 2, 3)
x
[1] 1 2 3
y <- c("one", "two", "three")
y
[1] "one"   "two"   "three"

You can access individual items in lists using indexing.

x[1]
[1] 1
y[2]
[1] "two"

Data Frames

The second primary way you store data in R is the data frame, which is an object with rows and columns (a list of vectors). The rows are observations, the columns are variables. You can build a data frame by first creating several individual vectors and then combining them. In the following example, we create three different vectors of lists and then combine them into a data frame.

x <- 1:3 # Creates a vector of integers from n1:n2, count by 1
y <- seq(from = 4, to = 6, by = 1) # another vector from 4:6
z <- rep(7, times = 3) # a vector of 3 sevens
grp <- c(0, 1, NA) # “NA” indicates a missing value
dat <- data.frame(x, y, z, grp) # put each vector in a column
dat
  x y z grp
1 1 4 7   0
2 2 5 7   1
3 3 6 7  NA

If each row is an observation and each column a variable, then observation 1 has a score of 1 on x, 4 on y, and 7 on z Later we will create data frames by reading a file, and describe how to do some basic operations on them.

Functions

In R, functions are the way you get work done. Often, they take a value, do something to it, and return a new value. They have the general form function_name( argument(s) ). R has many built in functions:

range(x) # range (min and max) of values in a vector
[1] 1 3
sum(x) # sum of all values in a vector
[1] 6
prod(x) # product of all values in one vector
[1] 6

As you saw, the function applied itself to all the values in the vector. This also works for data frames:

sum(dat)
[1] NA

The sum() function returns a missing value (NA) because there are missing values in the data frame. To remove those missing values and calculate the sum of all observed values, include the na.rm (NA-remove) argument:

sum(dat, na.rm = TRUE)
[1] 43

Functions typically have a large number of arguments that allow you to specify in more detail how the function runs, and what output it produces. To see these options for a function, type a question mark in front of the function name:

?mean

Setting the Working Directory

Using R requires you to think in a fundamentally different way about working with your files. Instead of opening a script or reading in data by browsing to that file, you can actually tell R to work within that directory itself. Essentially, you can think about it as moving R to your files instead of moving your files to R. To do this, first find your current working directory.

getwd() # notice backslashes in Windows are changed to forward
[1] "/home/pauljohn/GIT/CRMDA/guides/20.IntroToR"

Then set your new working directory (say, where you have your data files & R scripts), e.g.:

setwd(“C:/Users/Username/Documents/My_Project”)
setwd(“E:/SEM/Project_1”) # on Windows, must change “\” to “/”

Now you can open files by name in this directory rather than by file path. To show a list of filenames in the current directory you can use the following function:

dir()

Reading Data into R

To read in data, you will use one of the following functions. Each takes a filename as one of the arguments, and returns a data frame object.

dat <- read.table(“file”) # Reads any file in table format
PosNeg <- read.csv(“file”, header = TRUE) # Reads a *.csv file

If you don’t have a header (i.e., no variable names), set header = FALSE. The read.csv() function (CSV stands for “comma-separated values) is simply the read.table() function with the argument sep = “,” instead of the default sep = “ ” (a space). If your data is separated by a tab instead of a space, set sep = “\t” or use the read.delim() function:

myData <- read.delim(“filename”)

All these functions have important options, including sep, quote, row.names, na.strings, and skip. Particularly important is the option na.strings, which tells R what to interpret missing values as. For instance, if the value −999 means it is a missing value, specify the argument na.strings = “-999”. See how to use these options by typing

?read.table

For ease of use, stick to text files or *.csv, and use excel to convert file formats when necessary. Once you have read in your data, you can work with your data frame with some of the following functions/methods. This is an important step that helps you verify that the data were read in correctly.

names(dat) # or colnames(dat) for names of variables
head(dat, 10) # view the first 10 lines of the data.frame
summary(dat) # a statistical summary of each variable

It may be necessary or desirable to change the default names of the variables in the data frame after it has been read in. To do this, first create a vector of variable names –

varnames <- c("name1", "name2", "name3")

Then assign this vector of variable names to the variable names of the data frame. To do this, we get the original column names, and assign to it the new vector we created.

colnames(dat) <- varnames
head(dat)
  name1 name2 name3 NA
1     1     4     7  0
2     2     5     7  1
3     3     6     7 NA

Summarizing Data in Tables and Figures

To plot univariate or bivariate categorical data, you can print a table or frequencies/counts. Just specify the two vectors you want to describe. Here is a univariate table of a variable u1 with 3 categories (coded 0, 1, and 2):

myData <- read.table("http://www.statmodel.com/usersguide/chap3/ex3.12.dat")
names(myData) <- c("u1", "u2", "u3", "x1", "x2", "x3")
table(myData$u1)

  0   1   2 
199 113 188 

Here is a bivariate table of variables u1 & u2 with 3 and 2 categories, respectively:

table(myData$u1, myData$u2)
   
      0   1
  0 161  38
  1  61  52
  2  42 146

To add marginal counts to the table, save the table as an object, then use the addmargins() function:

myTable <- table(myData$u1, myData$u2)
addmargins(myTable)
     
        0   1 Sum
  0   161  38 199
  1    61  52 113
  2    42 146 188
  Sum 264 236 500

You can also see percentages instead of counts by using the prop.table() function:

prop.table(myTable)
   
        0     1
  0 0.322 0.076
  1 0.122 0.104
  2 0.084 0.292

To specify row percentages or column percentages, specify the first or second dimension of the table:

prop.table(myTable, 1)
   
            0         1
  0 0.8090452 0.1909548
  1 0.5398230 0.4601770
  2 0.2234043 0.7765957
prop.table(myTable, 2)
   
            0         1
  0 0.6098485 0.1610169
  1 0.2310606 0.2203390
  2 0.1590909 0.6186441

The aggregate() function can be used to make a table of continuous measures (e.g., mean or standard deviation) in each level of one or more categorical variables:

aggregate(x1 ~ u1, data = myData, mean)
  u1          x1
1  0 -0.70744944
2  1  0.03680895
3  2  0.84773632
aggregate(x1 ~ u1 + u2, data = myData, sd)
  u1 u2        x1
1  0  0 0.8881932
2  1  0 0.6171643
3  2  0 0.6392670
4  0  1 0.7372694
5  1  1 0.6219115
6  2  1 0.8917143

A corresponding boxplot can be made using boxplot() function:

boxplot(x1 ~ u1 + u2, data = myData)

An appropriate univariate plot of a continuous variable is a histogram:

hist(myData$x1, breaks = 30) # “breaks = __” is optional

A bivariate relationship between two continuous variables (e.g., a correlation) can be represented with a scatterplot:

plot(x2 ~ x1, data = myData, main = "My Scatterplot")

Note!

A title can be added to any graph with main, and labels can be added to the x and y axes using xlab and ylab.

There are many ways to customize graphs. See ?plot for more options.

Distinguishing Data Frames and Matrices: Choosing Subsets

Matrices are like vectors, but they are 2-dimensional. (It is also possible to have 3 or more dimensional arrays in R.) Like a vector, every element in a matrix must be of the same type (e.g., all numeric or all characters). A data frame, on the other hand, is a merely list of vectors, so each column/vector of a data frame can be a different type (one numeric, one character, etc.). Because all vectors in a data frame must have the same length, it resembles a matrix in that it is square (N rows by P columns, where each column in a data frame is really a separate vector).

In both matrices and data frames, you can access specific rows and columns using brackets after the name of the object, separating row numbers/names and column numbers/names with a comma: myData[row, col]. Reading the first four rows of data is the same as using the head() function:

head(myData, 4)
  u1 u2 u3        x1        x2        x3
1  1  0  2  0.573051 -0.175230 -1.339954
2  1  1  2 -0.577052  0.425472  0.179867
3  0  0  0 -0.694153 -0.766538  0.455033
4  0  0  0 -0.817974 -1.559255  0.579605
myData[1:4, ]
  u1 u2 u3        x1        x2        x3
1  1  0  2  0.573051 -0.175230 -1.339954
2  1  1  2 -0.577052  0.425472  0.179867
3  0  0  0 -0.694153 -0.766538  0.455033
4  0  0  0 -0.817974 -1.559255  0.579605

Notice that leaving the columns space empty means that you select all columns (likewise if you leave the rows space empty). You can select rows or columns by specifying the appropriate numbers or the corresponding names:

myData[1:4, c(1:3, 6)]
  u1 u2 u3        x3
1  1  0  2 -1.339954
2  1  1  2  0.179867
3  0  0  0  0.455033
4  0  0  0  0.579605
myData[1:4, c("u1", "u2", "u3", "x3")]
  u1 u2 u3        x3
1  1  0  2 -1.339954
2  1  1  2  0.179867
3  0  0  0  0.455033
4  0  0  0  0.579605

Data frames (but not matrices) are also lists (i.e., a list of column vectors), so their columns can also be selected using the dollar sign after the object name. (Note that only one column at a time can be specified this way)

head(myData$x1)
[1]  0.573051 -0.577052 -0.694153 -0.817974  0.463916 -0.096545

Logical vectors can be used to specify rows that match certain criteria. Logical vectors consist of values that are either TRUE or FALSE (in all caps). For example, to see whether each value of x1 is positive:

head(myData$x1 >= 0, n = 25)
 [1]  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
[12] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE
[23] FALSE FALSE FALSE

To compare two values, the possible commands are

Commonly Used Commands and What They Do
Command Interpretation
== equal
!= not equal
> greater than
< less than
>= greater than or equal to
<= less than or equal to
A %in% B test whether each element in vector A is one of the elements in vector B

We can conduct logical tests using Boolean operators:

Boolean Operators and What They Do
Command Interpretation Example
& and T & T = T T & F = F F & T = F F & F = F
| or T | T = T T | F = T F | T = T F | F = F
! not !T = F !F = T

You can place a vector of logical values corresponding to some criterion (say, only levels 0 and 1 of the variable u1) in the rows bracket, and for every Nth element that is TRUE, that Nth row will be selected:

## the 5th, 7th, 8th, and 10th rows are FALSE
head(myData$u1 %in% c(0, 1), n = 15)
 [1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE
[12]  TRUE  TRUE FALSE FALSE
## the 5th, 7th, 8th, and 10th rows are omitted
head(myData[myData$u1 %in% c(0, 1), ], n = 15)
   u1 u2 u3        x1        x2        x3
1   1  0  2  0.573051 -0.175230 -1.339954
2   1  1  2 -0.577052  0.425472  0.179867
3   0  0  0 -0.694153 -0.766538  0.455033
4   0  0  0 -0.817974 -1.559255  0.579605
6   0  0  1 -0.096545 -0.352276  0.253673
9   0  0  0  0.761720 -1.901134 -2.223851
11  1  1  0 -0.295120  0.881524  0.966334
12  1  1  1 -0.320148  0.297111  0.508650
13  1  1  1 -0.805411  0.766234  0.496932
19  1  1  1 -0.180285  0.271133 -0.162098
20  0  0  0 -1.394741 -0.969746 -2.090380
21  0  0  2  0.446963  0.080041 -0.838566
23  0  0  0 -0.064293 -0.935022 -0.971525
24  0  0  1 -1.530256  0.228302 -0.469326
25  0  0  0 -0.090957 -0.097848  0.167197

Add the additional condition that values of x1 are positive:

myRows <- myData$u1 %in% c(0, 1) & myData$x1 >= 0
head(myData[myRows, ], n = 10)
   u1 u2 u3       x1        x2        x3
1   1  0  2 0.573051 -0.175230 -1.339954
9   0  0  0 0.761720 -1.901134 -2.223851
21  0  0  2 0.446963  0.080041 -0.838566
30  1  0  3 0.361164  0.628857 -0.597038
44  1  0  1 0.147681  0.611826 -0.148284
47  1  0  1 0.211655 -0.799270 -0.773233
52  1  0  2 0.402807  0.365539 -0.894718
55  1  0  3 0.250665  0.868009 -0.406890
64  1  1  0 0.998745 -1.785965  1.313883
67  1  0  2 1.991547 -1.139637 -2.261151

List of Functions/Operators

Name Description Example Usage
install.packages() Installs packages install.packages("packageName")
library() Loads packages library(packageName)
c() Builds a vector x <- c(1, 2, 3)
list() Builds a list y <- list(1:3, 1, 2, 3)
data.frame() Builds data frame from variables/vectors myData <- data.frame(x, y, z)
: Creates a vector with values from x : y x <- 1:3
range() Returns the min and max elements in a vector range(x)
sum() Calculates the sum of elements in a vector sum(x)
mean() Calculates the mean of the elements in a vector mean(x)
prod() Calculates the product of elements in a vector prod(x)
cor() Calculates correlations of 2 vectors or all data cor(x, y) cor(myData)
getwd() Returns the current working directory getwd()
setwd() Sets the current working directory setwd("C:/mypath/data/")
dir() Returns files in the current working directory dir()
? Launches the documentation for a function ?mean
read.table() Read data from files (in or relative to the current working directory) myData <- read.table("x.dat"
read.csv() Read data from files (in or relative to the current working directory) myData <- read.csv("x.csv")
read.delim() Read data from files (in or relative to the current working directory) myData <- read.delim("x.dat")
names() Returns the names of variables in a data frame names(myData)
colnames() Returns the names of variables in a data frame colnames(myData)
colnames() Can also assign the column names newNames <- c("x1", "x2", "y") colnames(myData) <- newNames
head() Returns the first N (e.g., 6 or 10) rows in a data frame head(myData, n = 10)
tail() Returns the last N (e.g., 6 or 10) rows in a data frame tail(myData, n = 6)
summary() Calculates summary statistics for variables in a data frame summary(data)
aggregate() Calculates summary statistics for variables in a data frame aggregate(y ~ group, data = myData, mean)
Relational Operators Comparison of values in vectors x < y; x > y; x <= y; x >= y; x == y; x != y; x %in% y
Binary Arithmetic Operators Perform arithmetic on numeric x + y; x - y; x * y; x / y; (x ^ y OR x**y [power]); x %% y (modulo); x %/% y (integer division)
Logical Operators Boolean tests of logical values !x; x & y; x | y
Extractors Extract elements of a matrix or a data frame x[i] (for 1-dimensional vectors); x[i, j]; x$j (only for data.frames)
Plot Functions Creates figures plot(y ~ x); hist(y); boxplot(y ~ x)
Frequency and Contingency Tables View tables of counts/percentages of categorical variable(s) table(x); addmargins(table(x)); prop.table(table(x), 1:2)

References

Crawley, M. J. (2007). The R Book. West Sussex, England: John Wiley & Sons, Ltd.

Hornick, K. (2009). The R FAQ. Retrieved from http://CRAN.R-project.org/doc/FAQ/R- FAQ.html

R Development Core Team (2009). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.

Short, T. (2004, 11 07). R Reference Card. Retrieved 07 2010, from http://cran.r-project.org/doc/contrib/Short-refcard.pdf

Session Info

R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.10

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] crmda_0.27

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.7     digest_0.6.10   assertthat_0.1  plyr_1.8.4     
 [5] xtable_1.8-2    formatR_1.4     magrittr_1.5    evaluate_0.10  
 [9] stringi_1.1.2   openxlsx_3.0.0  rmarkdown_1.1   tools_3.3.2    
[13] stringr_1.1.0   kutils_0.42     yaml_2.1.13     htmltools_0.3.5
[17] knitr_1.14      tibble_1.2     

Available under Created Commons license 3.0 CC BY