# data.frame Introduction

#### List

List is a diverse collection of R objects. Any R object can be inserted in a list. #### Data Frame {.bs-callout .bs-callout-red} An R data.frame is an R list, but with one restriction: The number of rows in each element in the list must be identical.

A “spread sheet” is the usual way to think of a data.frame. Each column is a variable and each row is a survey respondent or participant in a study.

## A data frame is a collection of variables.

Lets make some variables of different types:

N <- 100
x1 <- rnorm(N, m = 0, sd = 10)
x2 <- rpois(N, lambda = 7)
x3 <- sample(letters[1:26], N, replace = TRUE)
x4 <- gl(5, N/5, labels = c("low", "luke", "med", "warm", "hot"))
class(x1)
 "numeric"
class(x2)
 "integer"
class(x3)
 "character"
class(x4)
 "factor"

The data.frame() function will staple those together as columns:

dat <- data.frame(x1, x2, x3, x4, stringsAsFactors = FALSE)

## See what’s in that data frame

#### 2. str

str(dat)
'data.frame':   100 obs. of  4 variables:
$x1: num 3.61 1.64 -14.59 -1.79 -5.68 ...$ x2: int  6 7 5 8 6 4 12 11 5 8 ...
$x3: chr "u" "l" "k" "y" ...$ x4: Factor w/ 5 levels "low","luke","med",..: 1 1 1 1 1 1 1 1 1 1 ...

#### 3. Matrix row/Column syntax

Inspect some rows by syntax dat[ index, ], similar to matrices

dat[c(1, 10:14, 99), ]
          x1 x2 x3  x4
1   3.613288  6  u low
10  5.620089  8  q low
11 15.111557 10  m low
12  9.137402  9  x low
13  4.314263  6  y low
14 15.040315 10  z low
99 -8.879234 11  c hot

#### 4. Use View in the GUI

Use View

View(dat)

opens up a table view

### 5. Extract Columns

Extract a column in either of 3.5 ways!

a. Take the 3rd column by integer index
dat[ , 3]
   "u" "l" "k" "y" "w" "w" "u" "t" "m" "q" "m" "x" "y" "z" "v" "m"
 "x" "k" "n" "o" "l" "b" "w" "a" "t" "w" "e" "w" "a" "l" "h" "x"
 "e" "t" "g" "p" "k" "n" "d" "b" "b" "k" "k" "w" "b" "d" "r" "b"
 "c" "d" "t" "o" "k" "t" "n" "l" "f" "d" "q" "u" "k" "i" "b" "w"
 "n" "v" "c" "j" "g" "e" "w" "i" "t" "w" "q" "h" "t" "w" "f" "m"
 "v" "c" "h" "d" "x" "g" "b" "u" "w" "t" "i" "u" "f" "y" "d" "l"
 "q" "f" "c" "w"
b. Take the 3rd column by its name
dat[ , "x3"]
   "u" "l" "k" "y" "w" "w" "u" "t" "m" "q" "m" "x" "y" "z" "v" "m"
 "x" "k" "n" "o" "l" "b" "w" "a" "t" "w" "e" "w" "a" "l" "h" "x"
 "e" "t" "g" "p" "k" "n" "d" "b" "b" "k" "k" "w" "b" "d" "r" "b"
 "c" "d" "t" "o" "k" "t" "n" "l" "f" "d" "q" "u" "k" "i" "b" "w"
 "n" "v" "c" "j" "g" "e" "w" "i" "t" "w" "q" "h" "t" "w" "f" "m"
 "v" "c" "h" "d" "x" "g" "b" "u" "w" "t" "i" "u" "f" "y" "d" "l"
 "q" "f" "c" "w"
c. Take the 3rd column by the $"accessor" shortcut. dat$x3
   "u" "l" "k" "y" "w" "w" "u" "t" "m" "q" "m" "x" "y" "z" "v" "m"
 "x" "k" "n" "o" "l" "b" "w" "a" "t" "w" "e" "w" "a" "l" "h" "x"
 "e" "t" "g" "p" "k" "n" "d" "b" "b" "k" "k" "w" "b" "d" "r" "b"
 "c" "d" "t" "o" "k" "t" "n" "l" "f" "d" "q" "u" "k" "i" "b" "w"
 "n" "v" "c" "j" "g" "e" "w" "i" "t" "w" "q" "h" "t" "w" "f" "m"
 "v" "c" "h" "d" "x" "g" "b" "u" "w" "t" "i" "u" "f" "y" "d" "l"
 "q" "f" "c" "w"
d. Because a data.frame is, technically, also an R list, it is
allowed to access columns in the way that list elements are
accessed.

Observe:
x3.1 <- dat["x3"]
class(x3.1)
 "data.frame"

Note that x3.1 is still a data.frame object, which has this weird-looking implication.

x3.1$x3   "u" "l" "k" "y" "w" "w" "u" "t" "m" "q" "m" "x" "y" "z" "v" "m"  "x" "k" "n" "o" "l" "b" "w" "a" "t" "w" "e" "w" "a" "l" "h" "x"  "e" "t" "g" "p" "k" "n" "d" "b" "b" "k" "k" "w" "b" "d" "r" "b"  "c" "d" "t" "o" "k" "t" "n" "l" "f" "d" "q" "u" "k" "i" "b" "w"  "n" "v" "c" "j" "g" "e" "w" "i" "t" "w" "q" "h" "t" "w" "f" "m"  "v" "c" "h" "d" "x" "g" "b" "u" "w" "t" "i" "u" "f" "y" "d" "l"  "q" "f" "c" "w" We probably did not want a data frame with just x3, so the double bracket comes in handy x3.2 <- dat[["x3"]] class(x3.2)  "character" is.factor(x3.2)  FALSE ## summary or summarize summary(dat)  x1 x2 x3 x4 Min. :-21.079 Min. : 2.00 Length:100 low :20 1st Qu.: -5.655 1st Qu.: 5.00 Class :character luke:20 Median : 1.434 Median : 7.00 Mode :character med :20 Mean : 2.092 Mean : 7.16 warm:20 3rd Qu.: 9.885 3rd Qu.: 9.00 hot :20 Max. : 25.535 Max. :14.00  I think output from rockchalk summarise is better library(rockchalk) summarize(dat) Numeric variables x1 x2 min -21.08 2 med 1.43 7 max 25.53 14 mean 2.09 7.16 sd 10.86 2.71 skewness 0.10 0.22 kurtosis -0.63 -0.79 nobs 100 100 nmissing 0 0 Nonnumeric variables x3 x4 w : 12 low : 20 t : 8 luke : 20 b : 7 med : 20 k : 7 warm : 20 (All Others): 66 hot : 20 nobs : 100 nobs : 100 nmiss : 0 nmiss: 0 entropy : 4.39 entropy : 2.3 normedEntropy: 0.94 normedEntropy: 1.0 # Rename data frame columns #### 1 dimnames 1. Use the dimnames function to rename both rows and columns in one command. This is identical to the way it is done in an R matrix: dimnames(dat) <- list(paste0("r", 1:100), paste0('a', 1:4)) head(dat)  a1 a2 a3 a4 r1 3.613288 6 u low r2 1.642283 7 l low r3 -14.591250 5 k low r4 -1.791002 8 y low r5 -5.678591 6 w low r6 -3.973017 4 w low #### 2 colnames, rownames 1. The functions colnames() and rownames() can be used to retrieve names or set them, depending on whether they are followed by <-. colnames(dat)  "a1" "a2" "a3" "a4" colnames(dat) <- c("x1", "x2", "x3", "x4") head(dat)  x1 x2 x3 x4 r1 3.613288 6 u low r2 1.642283 7 l low r3 -14.591250 5 k low r4 -1.791002 8 y low r5 -5.678591 6 w low r6 -3.973017 4 w low #### 3 names 1. Because a data.frame is also an R list, with the special quality that its elements have the same number of rows, it is also allowed to change column numbers with the names() function. # Re-calculate new variables dat$x2log <- log(dat$x2) # Interesting problem I ran into recently. I usually think of a data.frame as a set of columns. I think most people do. However, that’s just wrong. A data.frame object can have elements that are matrices or other data.frames. This often happens by accident. I do a calculation where I add a column to a data frame. N <- 100 x1 <- rnorm(N, m = 0, sd = 10) x2 <- rpois(N, lambda = 7) dat2 <- data.frame(x1, x2) Here’s a fitted regression: m1 <- lm(x1 ~ x2, data = dat2) Often, we might take predicted values or residuals, say dat2$pred <- predict(m1)

That’s OK, as you can see we have a new column on the right side of the data frame:

head(dat2)
           x1 x2       pred
1   0.5811232  7 -0.1819824
2  21.2709273 10  1.6117017
3  16.8116932  2 -3.1714559
4  -5.8627918 11  2.2095964
5  -2.6706822  9  1.0138070
6 -14.5548998  7 -0.1819824

However, a bad accident can happen if the return from predict happens to be a matrix. Consider this:

dat2$otherpred <- predict(m1, interval = "confidence") The thing, “otherpred” is a matrix with 3 columns. However, R let me insert it onto the data frame as if it were a column. Now, accessing those elements will be SUPER-confusing. head(dat2)  x1 x2 pred otherpred.fit otherpred.lwr otherpred.upr 1 0.5811232 7 -0.1819824 -0.1819824 -1.9909256 1.6269608 2 21.2709273 10 1.6117017 1.6117017 -1.1127209 4.3361243 3 16.8116932 2 -3.1714559 -3.1714559 -6.9252780 0.5823663 4 -5.8627918 11 2.2095964 2.2095964 -1.0430134 5.4622062 5 -2.6706822 9 1.0138070 1.0138070 -1.2560569 3.2836709 6 -14.5548998 7 -0.1819824 -0.1819824 -1.9909256 1.6269608 summary(dat2)  x1 x2 pred Min. :-25.1711 Min. : 1.00 Min. :-3.7694 1st Qu.: -6.0451 1st Qu.: 5.00 1st Qu.:-1.3778 Median : 0.4727 Median : 7.00 Median :-0.1820 Mean : -0.2179 Mean : 6.94 Mean :-0.2179 3rd Qu.: 5.6533 3rd Qu.: 9.00 3rd Qu.: 1.0138 Max. : 21.2709 Max. :15.00 Max. : 4.6012 otherpred.fit otherpred.lwr otherpred.upr Min. :-3.769351 Min. :-8.118528 Min. : 0.579827 1st Qu.:-1.377772 1st Qu.:-3.600259 1st Qu.: 0.844716 Median :-0.181982 Median :-1.990926 Median : 1.626961 Mean :-0.217856 Mean :-2.670808 Mean : 2.235096 3rd Qu.: 1.013807 3rd Qu.:-1.256057 3rd Qu.: 3.283671 Max. : 4.601175 Max. :-1.016546 Max. :10.264689  You’ll get errors trying to access the otherpred “column” if you try dat2$otherpred.

In case you do want to add a multi-column thing to a data frame, the right way to do it will either involve the R function cbind() or merge().

otherpred <- predict(m1, interval = "confidence")
dat3 <- cbind(dat2, otherpred)
head(dat3)
           x1 x2       pred otherpred.fit otherpred.lwr otherpred.upr
1   0.5811232  7 -0.1819824    -0.1819824    -1.9909256     1.6269608
2  21.2709273 10  1.6117017     1.6117017    -1.1127209     4.3361243
3  16.8116932  2 -3.1714559    -3.1714559    -6.9252780     0.5823663
4  -5.8627918 11  2.2095964     2.2095964    -1.0430134     5.4622062
5  -2.6706822  9  1.0138070     1.0138070    -1.2560569     3.2836709
6 -14.5548998  7 -0.1819824    -0.1819824    -1.9909256     1.6269608
fit       lwr       upr
1 -0.1819824 -1.990926 1.6269608
2  1.6117017 -1.112721 4.3361243
3 -3.1714559 -6.925278 0.5823663
4  2.2095964 -1.043013 5.4622062
5  1.0138070 -1.256057 3.2836709
6 -0.1819824 -1.990926 1.6269608
dat4 <- merge(dat2, otherpred, by = "row.names")
head(dat4)
  Row.names          x1 x2       pred otherpred.fit otherpred.lwr
1         1   0.5811232  7 -0.1819824    -0.1819824    -1.9909256
2        10  -3.5546124  6 -0.7798771    -0.7798771    -2.6936340
3       100 -13.4315275  7 -0.1819824    -0.1819824    -1.9909256
4        11  -3.6873950  9  1.0138070     1.0138070    -1.2560569
5        12 -10.3701365  4 -1.9756665    -1.9756665    -4.6408554
6        13  14.6131337  8  0.4159123     0.4159123    -1.5254483
otherpred.upr        fit       lwr       upr
1     1.6269608 -0.1819824 -1.990926 1.6269608
2     1.1338798 -0.7798771 -2.693634 1.1338798
3     1.6269608 -0.1819824 -1.990926 1.6269608
4     3.2836709  1.0138070 -1.256057 3.2836709
5     0.6895225 -1.9756665 -4.640855 0.6895225
6     2.3572729  0.4159123 -1.525448 2.3572729

I prefer using merge because, in the olden days, it dealt with missing values in a more graceful way. Today, I don’t think it matters much. Unless I do the merge incorrectly.

I thought it would be easy to show those are identical, but I’m having some trouble. I think my merge is wrong.

Get rid of that first column in dat4

dat4[ , "Row.names"] <- NULL
all.equal(dat3, dat4)
  "Attributes: < Component \"row.names\": Modes: character, numeric >"
 "Attributes: < Component \"row.names\": target is character, current is numeric >"
 "Component \"x1\": Mean relative difference: 1.329478"
 "Component \"x2\": Mean relative difference: 0.4964789"
 "Component \"pred\": Mean relative difference: 1.545756"
 "Component \"otherpred\": Attributes: < Component \"dimnames\": Component 1: 89 string mismatches >"
 "Component \"otherpred\": Mean relative difference: 0.9939953"
 "Component \"fit\": Mean relative difference: 1.545756"
 "Component \"lwr\": Mean relative difference: 0.7568096"
 "Component \"upr\": Mean relative difference: 0.9390171"                                            
sum(dat3$fit - otherpred[ , "fit"])  0 sum(dat3$lwr -  otherpred[ , "lwr"])
 0
sum(abs(dat4$fit - otherpred[ , "fit"]))  168.6063 sum(abs(dat4$lwr -otherpred[ , "lwr"]))
 158.6835
plot(dat4$fit, otherpred[ , "fit"]) Humphf! dat4 <- merge(dat2, otherpred, by = "row.names") head(dat4)  Row.names x1 x2 pred otherpred.fit otherpred.lwr 1 1 0.5811232 7 -0.1819824 -0.1819824 -1.9909256 2 10 -3.5546124 6 -0.7798771 -0.7798771 -2.6936340 3 100 -13.4315275 7 -0.1819824 -0.1819824 -1.9909256 4 11 -3.6873950 9 1.0138070 1.0138070 -1.2560569 5 12 -10.3701365 4 -1.9756665 -1.9756665 -4.6408554 6 13 14.6131337 8 0.4159123 0.4159123 -1.5254483 otherpred.upr fit lwr upr 1 1.6269608 -0.1819824 -1.990926 1.6269608 2 1.1338798 -0.7798771 -2.693634 1.1338798 3 1.6269608 -0.1819824 -1.990926 1.6269608 4 3.2836709 1.0138070 -1.256057 3.2836709 5 0.6895225 -1.9756665 -4.640855 0.6895225 6 2.3572729 0.4159123 -1.525448 2.3572729 Now I see what’s wrong. Once again, I was bone-crushed by the merge function’s decision to shuffle my rows. dat4 <- dat4[order(as.numeric(dat4[ , "Row.names"])), ] plot(dat4$fit, otherpred[ , "fit"]) Or, more simply

dat4 <- merge(dat2, otherpred, by = "row.names", sort = FALSE)
plot(dat4\$fit, otherpred[ , "fit"]) 