crmda@ku.edu

Abstract

This guide is designed to introduce brand new users of R to its basic functionality. It will cover installation and loading of packages, importing data, working with vectors, data frames, and functions, and summarizing data in tables and figures. It will not cover statistical analyses or more advanced topics such as simulating data, creating user-defined functions, or using control-flow operators such as`for`

loops or `if`

logic.
R is an open source software environment for statistical computing. One of the primary advantages to using R for statistical computing is that it is highly extensible—not only can you create your own functions and packages, but the R community as a whole builds and maintains thousands of such packages. For example, there are several packages designed for latent variable analysis including `lavaan`

, `OpenMx`

, `ltm`

, and `sem`

.

Another primary advantage for using R is that it is a programming language designed for statistical analysis. Using a programming language allows for very low-level customization (the analytical process tends to feel “hands on”) but can also increase the learning curve. In this guide we will examine the very basics of the R language, which are necessary to use the many packages for data analysis in R.

If you have not yet installed R, find it at this link: http://cran.r-project.org

We also have helpful step-by-step guides for installing R on Windows and Mac!

Using installation defaults should generally be sufficient for most users. To install packages (for instance, `lavaan`

), open R and type this at the command prompt:

`install.packages(“lavaan”)`

Select a CRAN mirror such as University of Kansas when prompted (any in the same continent as your computer should be sufficient). Load the package by typing the following:

`library(lavaan)`

Typically, it is best to learn programming languages by actually programming. So as you read through this guide, do the examples and play around by trying different things on your own. (Note: Type out the examples yourself, as copy/pasting from a PDF to the R terminal may not result in the expected plain-text format of certain characters, such as quotation marks.)

Have some fun! You will find cited at the end of the guide several resources I’ve found helpful (including a reference card). The following links are also basic introductions, but more thorough and extensive than what is provided here:

http://scc.stat.ucla.edu/page_attachments/0000/0141/10S-basicR.pdf

http://www.stat.auckland.ac.nz/~stat782/downloads/01-Basics.pdf

R is primarily an interpreted language, which means that syntax is executed and run through an interpreter (in this case the R console window) rather than compiled into machine code. Because you work in R by simply typing in commands and having them evaluated, it is best to use a text editor to save your commands so you can run, review, and change your code as necessary.

The RGui interface contains a basic text editor which you can open by selecting “New Script” under the “File” menu. From the text editor, you can evaluate lines of your code by hitting Ctrl- R.

In interpreted languages, what you type in at the prompt gets evaluated as a computation. When starting R, you are greeted with some text describing the version information, citation instructions, and instructions for help. Below that is the prompt: “>”

`>`

Expressions typed for computation at the prompt get evaluated immediately. To see this, type the simple mathematical expression 2 + 3 into the interpreter:

`> 2 + 3`

`[1] 5`

The result returned is 5.

In this guide, expressions typed by the user will be in the first (shaded) box, while the result returned by evaluating the expression will be displayed in the second (white) box, prefaced with [#].

Expressions typed by the user (in this guide) will not be prefaced with `>`

. Anything following the pound symbol (or hash mark) “#”, typed by the user, is a comment and is not evaluated.

Programming languages are built around data, and doing operations on that data. In many programming languages, data is stored in variables—just think of them as boxes. Each box has a name, and you access the contents of the box by referencing its name. In R, the basic data storage device is a vector, which can contain any number of values. You can think of a vector as a stack of boxes, each of which you can access by referencing its name.

A vector can be named any sequence of characters. To assign to (store things in) a vector, you use the assignment key `<-`

, which is obviously 2 characters: `<`

and `-`

(less than and dash).

The following are examples of creating vectors and storing things in them. Access the value you stored by typing the name of the vector you created into the interpreter.

```
A <- 3
A
```

`[1] 3`

```
B12 <- "a string"
B12
```

`[1] "a string"`

Vectors can also store lists of values.

```
x <- c(1, 2, 3)
x
```

`[1] 1 2 3`

```
y <- c("one", "two", "three")
y
```

`[1] "one" "two" "three"`

You can access individual items in lists using indexing.

`x[1]`

`[1] 1`

`y[2]`

`[1] "two"`

The second primary way you store data in R is the data frame, which is an object with rows and columns (a list of vectors). The rows are observations, the columns are variables. You can build a data frame by first creating several individual vectors and then combining them. In the following example, we create three different vectors of lists and then combine them into a data frame.

```
x <- 1:3 # Creates a vector of integers from n1:n2, count by 1
y <- seq(from = 4, to = 6, by = 1) # another vector from 4:6
z <- rep(7, times = 3) # a vector of 3 sevens
grp <- c(0, 1, NA) # “NA” indicates a missing value
dat <- data.frame(x, y, z, grp) # put each vector in a column
dat
```

```
x y z grp
1 1 4 7 0
2 2 5 7 1
3 3 6 7 NA
```

If each row is an observation and each column a variable, then observation 1 has a score of 1 on x, 4 on y, and 7 on z Later we will create data frames by reading a file, and describe how to do some basic operations on them.

In R, functions are the way you get work done. Often, they take a value, do something to it, and return a new value. They have the general form `function_name( argument(s) )`

. R has many built in functions:

`range(x) # range (min and max) of values in a vector`

`[1] 1 3`

`sum(x) # sum of all values in a vector`

`[1] 6`

`prod(x) # product of all values in one vector`

`[1] 6`

As you saw, the function applied itself to all the values in the vector. This also works for data frames:

`sum(dat)`

`[1] NA`

The `sum()`

function returns a missing value (`NA`

) because there are missing values in the data frame. To remove those missing values and calculate the sum of all observed values, include the `na.rm`

(NA-remove) argument:

`sum(dat, na.rm = TRUE)`

`[1] 43`

Functions typically have a large number of arguments that allow you to specify in more detail how the function runs, and what output it produces. To see these options for a function, type a question mark in front of the function name:

`?mean`

Using R requires you to think in a fundamentally different way about working with your files. Instead of opening a script or reading in data by browsing to that file, you can actually tell R to work within that directory itself. Essentially, you can think about it as moving R to your files instead of moving your files to R. To do this, first find your current working directory.

`getwd() # notice backslashes in Windows are changed to forward`

`[1] "/home/pauljohn/GIT/CRMDA/guides/20.IntroToR"`

Then set your new working directory (say, where you have your data files & R scripts), e.g.:

```
setwd(“C:/Users/Username/Documents/My_Project”)
setwd(“E:/SEM/Project_1”) # on Windows, must change “\” to “/”
```

Now you can open files by name in this directory rather than by file path. To show a list of filenames in the current directory you can use the following function:

`dir()`

To read in data, you will use one of the following functions. Each takes a filename as one of the arguments, and returns a data frame object.

```
dat <- read.table(“file”) # Reads any file in table format
PosNeg <- read.csv(“file”, header = TRUE) # Reads a *.csv file
```

If you don’t have a header (i.e., no variable names), set `header = FALSE`

. The `read.csv()`

function (CSV stands for “comma-separated values) is simply the `read.table()`

function with the argument `sep = “,”`

instead of the default `sep = “ ”`

(a space). If your data is separated by a tab instead of a space, set `sep = “\t”`

or use the `read.delim()`

function:

`myData <- read.delim(“filename”)`

All these functions have important options, including `sep`

, `quote`

, `row.names`

, `na.strings`

, and `skip`

. Particularly important is the option `na.strings`

, which tells R what to interpret missing values as. For instance, if the value −999 means it is a missing value, specify the argument `na.strings = “-999”`

. See how to use these options by typing

`?read.table`

For ease of use, stick to text files or *.csv, and use excel to convert file formats when necessary. Once you have read in your data, you can work with your data frame with some of the following functions/methods. This is an important step that helps you verify that the data were read in correctly.

```
names(dat) # or colnames(dat) for names of variables
head(dat, 10) # view the first 10 lines of the data.frame
summary(dat) # a statistical summary of each variable
```

It may be necessary or desirable to change the default names of the variables in the data frame after it has been read in. To do this, first create a vector of variable names –

`varnames <- c("name1", "name2", "name3")`

Then assign this vector of variable names to the variable names of the data frame. To do this, we get the original column names, and assign to it the new vector we created.

```
colnames(dat) <- varnames
head(dat)
```

```
name1 name2 name3 NA
1 1 4 7 0
2 2 5 7 1
3 3 6 7 NA
```

To plot univariate or bivariate categorical data, you can print a table or frequencies/counts. Just specify the two vectors you want to describe. Here is a univariate table of a variable `u1`

with 3 categories (coded 0, 1, and 2):

```
myData <- read.table("http://www.statmodel.com/usersguide/chap3/ex3.12.dat")
names(myData) <- c("u1", "u2", "u3", "x1", "x2", "x3")
table(myData$u1)
```

```
0 1 2
199 113 188
```

Here is a bivariate table of variables `u1`

& `u2`

with 3 and 2 categories, respectively:

`table(myData$u1, myData$u2)`

```
0 1
0 161 38
1 61 52
2 42 146
```

To add marginal counts to the table, save the table as an object, then use the `addmargins()`

function:

```
myTable <- table(myData$u1, myData$u2)
addmargins(myTable)
```

```
0 1 Sum
0 161 38 199
1 61 52 113
2 42 146 188
Sum 264 236 500
```

You can also see percentages instead of counts by using the `prop.table()`

function:

`prop.table(myTable)`

```
0 1
0 0.322 0.076
1 0.122 0.104
2 0.084 0.292
```

To specify row percentages or column percentages, specify the first or second dimension of the table:

`prop.table(myTable, 1)`

```
0 1
0 0.8090452 0.1909548
1 0.5398230 0.4601770
2 0.2234043 0.7765957
```

`prop.table(myTable, 2)`

```
0 1
0 0.6098485 0.1610169
1 0.2310606 0.2203390
2 0.1590909 0.6186441
```

The `aggregate()`

function can be used to make a table of continuous measures (e.g., mean or standard deviation) in each level of one or more categorical variables:

`aggregate(x1 ~ u1, data = myData, mean)`

```
u1 x1
1 0 -0.70744944
2 1 0.03680895
3 2 0.84773632
```

`aggregate(x1 ~ u1 + u2, data = myData, sd)`

```
u1 u2 x1
1 0 0 0.8881932
2 1 0 0.6171643
3 2 0 0.6392670
4 0 1 0.7372694
5 1 1 0.6219115
6 2 1 0.8917143
```

A corresponding boxplot can be made using `boxplot()`

function:

`boxplot(x1 ~ u1 + u2, data = myData)`

An appropriate univariate plot of a continuous variable is a histogram:

`hist(myData$x1, breaks = 30) # “breaks = __” is optional`

A bivariate relationship between two continuous variables (e.g., a correlation) can be represented with a scatterplot:

`plot(x2 ~ x1, data = myData, main = "My Scatterplot")`

A title can be added to any graph with `main`

, and labels can be added to the `x`

and `y`

axes using `xlab`

and `ylab`

.

There are many ways to customize graphs. See `?plot`

for more options.

Matrices are like vectors, but they are 2-dimensional. (It is also possible to have 3 or more dimensional arrays in R.) Like a vector, every element in a matrix must be of the same type (e.g., all numeric or all characters). A data frame, on the other hand, is a merely list of vectors, so each column/vector of a data frame can be a different type (one numeric, one character, etc.). Because all vectors in a data frame must have the same length, it resembles a matrix in that it is square (N rows by P columns, where each column in a data frame is really a separate vector).

In both matrices and data frames, you can access specific rows and columns using brackets after the name of the object, separating row numbers/names and column numbers/names with a comma: `myData[row, col]`

. Reading the first four rows of data is the same as using the `head()`

function:

`head(myData, 4)`

```
u1 u2 u3 x1 x2 x3
1 1 0 2 0.573051 -0.175230 -1.339954
2 1 1 2 -0.577052 0.425472 0.179867
3 0 0 0 -0.694153 -0.766538 0.455033
4 0 0 0 -0.817974 -1.559255 0.579605
```

`myData[1:4, ]`

```
u1 u2 u3 x1 x2 x3
1 1 0 2 0.573051 -0.175230 -1.339954
2 1 1 2 -0.577052 0.425472 0.179867
3 0 0 0 -0.694153 -0.766538 0.455033
4 0 0 0 -0.817974 -1.559255 0.579605
```

Notice that leaving the columns space empty means that you select all columns (likewise if you leave the rows space empty). You can select rows or columns by specifying the appropriate numbers or the corresponding names:

`myData[1:4, c(1:3, 6)]`

```
u1 u2 u3 x3
1 1 0 2 -1.339954
2 1 1 2 0.179867
3 0 0 0 0.455033
4 0 0 0 0.579605
```

`myData[1:4, c("u1", "u2", "u3", "x3")]`

```
u1 u2 u3 x3
1 1 0 2 -1.339954
2 1 1 2 0.179867
3 0 0 0 0.455033
4 0 0 0 0.579605
```

Data frames (but not matrices) are also lists (i.e., a list of column vectors), so their columns can also be selected using the dollar sign after the object name. (Note that only one column at a time can be specified this way)

`head(myData$x1)`

`[1] 0.573051 -0.577052 -0.694153 -0.817974 0.463916 -0.096545`

Logical vectors can be used to specify rows that match certain criteria. Logical vectors consist of values that are either TRUE or FALSE (in all caps). For example, to see whether each value of `x1`

is positive:

`head(myData$x1 >= 0, n = 25)`

```
[1] TRUE FALSE FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE FALSE
[12] FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
[23] FALSE FALSE FALSE
```

To compare two values, the possible commands are

Command | Interpretation |
---|---|

`==` |
equal |

`!=` |
not equal |

`>` |
greater than |

`<` |
less than |

`>=` |
greater than or equal to |

`<=` |
less than or equal to |

`A %in% B` |
test whether each element in vector `A` is one of the elements in vector `B` |

We can conduct logical tests using Boolean operators:

Command | Interpretation | Example |
---|---|---|

`&` |
and | `T & T = T` `T & F = F` `F & T = F` `F & F = F` |

`|` |
or | `T | T = T` `T | F = T` `F | T = T` `F | F = F` |

`!` |
not | `!T = F` `!F = T` |

You can place a vector of logical values corresponding to some criterion (say, only levels 0 and 1 of the variable `u1`

) in the rows bracket, and for every N^{th} element that is TRUE, that N^{th} row will be selected:

```
## the 5th, 7th, 8th, and 10th rows are FALSE
head(myData$u1 %in% c(0, 1), n = 15)
```

```
[1] TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE FALSE TRUE
[12] TRUE TRUE FALSE FALSE
```

```
## the 5th, 7th, 8th, and 10th rows are omitted
head(myData[myData$u1 %in% c(0, 1), ], n = 15)
```

```
u1 u2 u3 x1 x2 x3
1 1 0 2 0.573051 -0.175230 -1.339954
2 1 1 2 -0.577052 0.425472 0.179867
3 0 0 0 -0.694153 -0.766538 0.455033
4 0 0 0 -0.817974 -1.559255 0.579605
6 0 0 1 -0.096545 -0.352276 0.253673
9 0 0 0 0.761720 -1.901134 -2.223851
11 1 1 0 -0.295120 0.881524 0.966334
12 1 1 1 -0.320148 0.297111 0.508650
13 1 1 1 -0.805411 0.766234 0.496932
19 1 1 1 -0.180285 0.271133 -0.162098
20 0 0 0 -1.394741 -0.969746 -2.090380
21 0 0 2 0.446963 0.080041 -0.838566
23 0 0 0 -0.064293 -0.935022 -0.971525
24 0 0 1 -1.530256 0.228302 -0.469326
25 0 0 0 -0.090957 -0.097848 0.167197
```

Add the additional condition that values of `x1`

are positive:

```
myRows <- myData$u1 %in% c(0, 1) & myData$x1 >= 0
head(myData[myRows, ], n = 10)
```

```
u1 u2 u3 x1 x2 x3
1 1 0 2 0.573051 -0.175230 -1.339954
9 0 0 0 0.761720 -1.901134 -2.223851
21 0 0 2 0.446963 0.080041 -0.838566
30 1 0 3 0.361164 0.628857 -0.597038
44 1 0 1 0.147681 0.611826 -0.148284
47 1 0 1 0.211655 -0.799270 -0.773233
52 1 0 2 0.402807 0.365539 -0.894718
55 1 0 3 0.250665 0.868009 -0.406890
64 1 1 0 0.998745 -1.785965 1.313883
67 1 0 2 1.991547 -1.139637 -2.261151
```

Name | Description | Example Usage |
---|---|---|

`install.packages()` |
Installs packages | `install.packages("packageName")` |

`library()` |
Loads packages | `library(packageName)` |

`c()` |
Builds a vector | `x <- c(1, 2, 3)` |

`list()` |
Builds a list | `y <- list(1:3, 1, 2, 3)` |

`data.frame()` |
Builds data frame from variables/vectors | `myData <- data.frame(x, y, z)` |

`:` |
Creates a vector with values from x : y | `x <- 1:3` |

`range()` |
Returns the min and max elements in a vector | `range(x)` |

`sum()` |
Calculates the sum of elements in a vector | `sum(x)` |

`mean()` |
Calculates the mean of the elements in a vector | `mean(x)` |

`prod()` |
Calculates the product of elements in a vector | `prod(x)` |

`cor()` |
Calculates correlations of 2 vectors or all data | `cor(x, y)` `cor(myData)` |

`getwd()` |
Returns the current working directory | `getwd()` |

`setwd()` |
Sets the current working directory | `setwd("C:/mypath/data/")` |

`dir()` |
Returns files in the current working directory | `dir()` |

`?` |
Launches the documentation for a function | `?mean` |

`read.table()` |
Read data from files (in or relative to the current working directory) | `myData <- read.table("x.dat"` |

`read.csv()` |
Read data from files (in or relative to the current working directory) | `myData <- read.csv("x.csv")` |

`read.delim()` |
Read data from files (in or relative to the current working directory) | `myData <- read.delim("x.dat")` |

`names()` |
Returns the names of variables in a data frame | `names(myData)` |

`colnames()` |
Returns the names of variables in a data frame | `colnames(myData)` |

`colnames()` |
Can also assign the column names | `newNames <- c("x1", "x2", "y")` `colnames(myData) <- newNames` |

`head()` |
Returns the first N (e.g., 6 or 10) rows in a data frame |
`head(myData, n = 10)` |

`tail()` |
Returns the last N (e.g., 6 or 10) rows in a data frame |
`tail(myData, n = 6)` |

`summary()` |
Calculates summary statistics for variables in a data frame | `summary(data)` |

`aggregate()` |
Calculates summary statistics for variables in a data frame | `aggregate(y ~ group, data = myData, mean)` |

Relational Operators | Comparison of values in vectors | `x < y` ; `x > y` ; `x <= y` ; `x >= y` ; `x == y` ; `x != y` ; `x %in% y` |

Binary Arithmetic Operators | Perform arithmetic on numeric | `x + y` ; `x - y` ; `x * y` ; `x / y` ; (`x ^ y` OR `x**y` [power]); `x %% y` (modulo); `x %/% y` (integer division) |

Logical Operators | Boolean tests of logical values | `!x` ; `x & y` ; `x | y` |

Extractors | Extract elements of a matrix or a data frame | `x[i]` (for 1-dimensional vectors); `x[i, j]` ; `x$j` (only for `data.frames` ) |

Plot Functions | Creates figures | `plot(y ~ x)` ; `hist(y)` ; `boxplot(y ~ x)` |

Frequency and Contingency Tables | View tables of counts/percentages of categorical variable(s) | `table(x)` ; `addmargins(table(x))` ; `prop.table(table(x), 1:2)` |

Crawley, M. J. (2007). *The R Book*. West Sussex, England: John Wiley & Sons, Ltd.

Hornick, K. (2009). *The R FAQ*. Retrieved from http://CRAN.R-project.org/doc/FAQ/R- FAQ.html

R Development Core Team (2009). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.

Short, T. (2004, 11 07). *R Reference Card*. Retrieved 07 2010, from http://cran.r-project.org/doc/contrib/Short-refcard.pdf

```
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.10
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] crmda_0.27
loaded via a namespace (and not attached):
[1] Rcpp_0.12.7 digest_0.6.10 assertthat_0.1 plyr_1.8.4
[5] xtable_1.8-2 formatR_1.4 magrittr_1.5 evaluate_0.10
[9] stringi_1.1.2 openxlsx_3.0.0 rmarkdown_1.1 tools_3.3.2
[13] stringr_1.1.0 kutils_0.42 yaml_2.1.13 htmltools_0.3.5
[17] knitr_1.14 tibble_1.2
```

Available under Created Commons license 3.0