data.frame Introduction

List

List is a diverse collection of R objects. Any R object can be inserted in a list. #### Data Frame {.bs-callout .bs-callout-red} An R data.frame is an R list, but with one restriction: The number of rows in each element in the list must be identical.

A “spread sheet” is the usual way to think of a data.frame. Each column is a variable and each row is a survey respondent or participant in a study.

A data frame is a collection of variables.

Lets make some variables of different types:

N <- 100
x1 <- rnorm(N, m = 0, sd = 10)
x2 <- rpois(N, lambda = 7)
x3 <- sample(letters[1:26], N, replace = TRUE)
x4 <- gl(5, N/5, labels = c("low", "luke", "med", "warm", "hot"))
class(x1)

[1] "numeric"

class(x2)

[1] "integer"

class(x3)

[1] "character"

class(x4)

[1] "factor"

The data.frame() function will staple those together as columns:

dat <- data.frame(x1, x2, x3, x4, stringsAsFactors = FALSE)

See what’s in that data frame

1. head

head(dat)

          x1 x2 x3  x4
1   3.613288  6  u low
2   1.642283  7  l low
3 -14.591250  5  k low
4  -1.791002  8  y low
5  -5.678591  6  w low
6  -3.973017  4  w low

2. str

str(dat)

'data.frame':   100 obs. of  4 variables:
 $ x1: num  3.61 1.64 -14.59 -1.79 -5.68 ...
 $ x2: int  6 7 5 8 6 4 12 11 5 8 ...
 $ x3: chr  "u" "l" "k" "y" ...
 $ x4: Factor w/ 5 levels "low","luke","med",..: 1 1 1 1 1 1 1 1 1 1 ...

3. Matrix row/Column syntax

Inspect some rows by syntax dat[ index, ], similar to matrices

dat[c(1, 10:14, 99), ]

          x1 x2 x3  x4
1   3.613288  6  u low
10  5.620089  8  q low
11 15.111557 10  m low
12  9.137402  9  x low
13  4.314263  6  y low
14 15.040315 10  z low
99 -8.879234 11  c hot

4. Use View in the GUI

Use View

View(dat)

opens up a table view

5. Extract Columns

Extract a column in either of 3.5 ways!

a. Take the 3rd column by integer index

dat[ , 3]

  [1] "u" "l" "k" "y" "w" "w" "u" "t" "m" "q" "m" "x" "y" "z" "v" "m"
 [17] "x" "k" "n" "o" "l" "b" "w" "a" "t" "w" "e" "w" "a" "l" "h" "x"
 [33] "e" "t" "g" "p" "k" "n" "d" "b" "b" "k" "k" "w" "b" "d" "r" "b"
 [49] "c" "d" "t" "o" "k" "t" "n" "l" "f" "d" "q" "u" "k" "i" "b" "w"
 [65] "n" "v" "c" "j" "g" "e" "w" "i" "t" "w" "q" "h" "t" "w" "f" "m"
 [81] "v" "c" "h" "d" "x" "g" "b" "u" "w" "t" "i" "u" "f" "y" "d" "l"
 [97] "q" "f" "c" "w"

b. Take the 3rd column by its name

dat[ , "x3"]

  [1] "u" "l" "k" "y" "w" "w" "u" "t" "m" "q" "m" "x" "y" "z" "v" "m"
 [17] "x" "k" "n" "o" "l" "b" "w" "a" "t" "w" "e" "w" "a" "l" "h" "x"
 [33] "e" "t" "g" "p" "k" "n" "d" "b" "b" "k" "k" "w" "b" "d" "r" "b"
 [49] "c" "d" "t" "o" "k" "t" "n" "l" "f" "d" "q" "u" "k" "i" "b" "w"
 [65] "n" "v" "c" "j" "g" "e" "w" "i" "t" "w" "q" "h" "t" "w" "f" "m"
 [81] "v" "c" "h" "d" "x" "g" "b" "u" "w" "t" "i" "u" "f" "y" "d" "l"
 [97] "q" "f" "c" "w"

c. Take the 3rd column by the $ "accessor" shortcut.

dat$x3

  [1] "u" "l" "k" "y" "w" "w" "u" "t" "m" "q" "m" "x" "y" "z" "v" "m"
 [17] "x" "k" "n" "o" "l" "b" "w" "a" "t" "w" "e" "w" "a" "l" "h" "x"
 [33] "e" "t" "g" "p" "k" "n" "d" "b" "b" "k" "k" "w" "b" "d" "r" "b"
 [49] "c" "d" "t" "o" "k" "t" "n" "l" "f" "d" "q" "u" "k" "i" "b" "w"
 [65] "n" "v" "c" "j" "g" "e" "w" "i" "t" "w" "q" "h" "t" "w" "f" "m"
 [81] "v" "c" "h" "d" "x" "g" "b" "u" "w" "t" "i" "u" "f" "y" "d" "l"
 [97] "q" "f" "c" "w"

d. Because a data.frame is, technically, also an R list, it is
allowed to access columns in the way that list elements are
accessed. 

Observe:

x3.1 <- dat["x3"]
class(x3.1)

[1] "data.frame"

Note that x3.1 is still a data.frame object, which has this weird-looking implication.

x3.1$x3

  [1] "u" "l" "k" "y" "w" "w" "u" "t" "m" "q" "m" "x" "y" "z" "v" "m"
 [17] "x" "k" "n" "o" "l" "b" "w" "a" "t" "w" "e" "w" "a" "l" "h" "x"
 [33] "e" "t" "g" "p" "k" "n" "d" "b" "b" "k" "k" "w" "b" "d" "r" "b"
 [49] "c" "d" "t" "o" "k" "t" "n" "l" "f" "d" "q" "u" "k" "i" "b" "w"
 [65] "n" "v" "c" "j" "g" "e" "w" "i" "t" "w" "q" "h" "t" "w" "f" "m"
 [81] "v" "c" "h" "d" "x" "g" "b" "u" "w" "t" "i" "u" "f" "y" "d" "l"
 [97] "q" "f" "c" "w"

We probably did not want a data frame with just x3, so the double bracket comes in handy

x3.2 <- dat[["x3"]]
class(x3.2)

[1] "character"

is.factor(x3.2)

[1] FALSE

summary or summarize

summary(dat)

       x1                x2             x3               x4    
 Min.   :-21.079   Min.   : 2.00   Length:100         low :20  
 1st Qu.: -5.655   1st Qu.: 5.00   Class :character   luke:20  
 Median :  1.434   Median : 7.00   Mode  :character   med :20  
 Mean   :  2.092   Mean   : 7.16                      warm:20  
 3rd Qu.:  9.885   3rd Qu.: 9.00                      hot :20  
 Max.   : 25.535   Max.   :14.00

I think output from rockchalk summarise is better

library(rockchalk)
summarize(dat)

Numeric variables
             x1       x2  
min        -21.08     2   
med          1.43     7   
max         25.53    14   
mean         2.09     7.16
sd          10.86     2.71
skewness     0.10     0.22
kurtosis    -0.63    -0.79
nobs       100      100   
nmissing     0        0   

Nonnumeric variables
            x3           x4            
 w           :  12   low  :  20        
 t           :   8   luke :  20        
 b           :   7   med  :  20        
 k           :   7   warm :  20        
 (All Others):  66   hot  :  20        
 nobs        : 100   nobs : 100        
 nmiss       :   0   nmiss:   0        
 entropy      : 4.39 entropy      : 2.3
 normedEntropy: 0.94 normedEntropy: 1.0

Rename data frame columns

1 dimnames

Use the dimnames function to rename both rows and columns in one command. This is identical to the way it is done in an R matrix:

dimnames(dat) <- list(paste0("r", 1:100), paste0('a', 1:4))
head(dat)

           a1 a2 a3  a4
r1   3.613288  6  u low
r2   1.642283  7  l low
r3 -14.591250  5  k low
r4  -1.791002  8  y low
r5  -5.678591  6  w low
r6  -3.973017  4  w low

2 colnames, rownames

The functions colnames() and rownames() can be used to retrieve names or set them, depending on whether they are followed by <-.

colnames(dat)

[1] "a1" "a2" "a3" "a4"

colnames(dat) <- c("x1", "x2", "x3", "x4")
head(dat)

           x1 x2 x3  x4
r1   3.613288  6  u low
r2   1.642283  7  l low
r3 -14.591250  5  k low
r4  -1.791002  8  y low
r5  -5.678591  6  w low
r6  -3.973017  4  w low

3 names

Because a data.frame is also an R list, with the special quality that its elements have the same number of rows, it is also allowed to change column numbers with the names() function.

Re-calculate new variables

dat$x2log <- log(dat$x2)

Interesting problem I ran into recently.

I usually think of a data.frame as a set of columns. I think most people do. However, that’s just wrong. A data.frame object can have elements that are matrices or other data.frames.

This often happens by accident. I do a calculation where I add a column to a data frame.

N <- 100
x1 <- rnorm(N, m = 0, sd = 10)
x2 <- rpois(N, lambda = 7)
dat2 <- data.frame(x1, x2)

Here’s a fitted regression:

m1 <- lm(x1 ~ x2, data = dat2)

Often, we might take predicted values or residuals, say

dat2$pred <- predict(m1)

That’s OK, as you can see we have a new column on the right side of the data frame:

head(dat2)

           x1 x2       pred
1   0.5811232  7 -0.1819824
2  21.2709273 10  1.6117017
3  16.8116932  2 -3.1714559
4  -5.8627918 11  2.2095964
5  -2.6706822  9  1.0138070
6 -14.5548998  7 -0.1819824

However, a bad accident can happen if the return from predict happens to be a matrix. Consider this:

dat2$otherpred <- predict(m1, interval = "confidence")

The thing, “otherpred” is a matrix with 3 columns. However, R let me insert it onto the data frame as if it were a column. Now, accessing those elements will be SUPER-confusing.

head(dat2)

           x1 x2       pred otherpred.fit otherpred.lwr otherpred.upr
1   0.5811232  7 -0.1819824    -0.1819824    -1.9909256     1.6269608
2  21.2709273 10  1.6117017     1.6117017    -1.1127209     4.3361243
3  16.8116932  2 -3.1714559    -3.1714559    -6.9252780     0.5823663
4  -5.8627918 11  2.2095964     2.2095964    -1.0430134     5.4622062
5  -2.6706822  9  1.0138070     1.0138070    -1.2560569     3.2836709
6 -14.5548998  7 -0.1819824    -0.1819824    -1.9909256     1.6269608

summary(dat2)

       x1                 x2             pred        
 Min.   :-25.1711   Min.   : 1.00   Min.   :-3.7694  
 1st Qu.: -6.0451   1st Qu.: 5.00   1st Qu.:-1.3778  
 Median :  0.4727   Median : 7.00   Median :-0.1820  
 Mean   : -0.2179   Mean   : 6.94   Mean   :-0.2179  
 3rd Qu.:  5.6533   3rd Qu.: 9.00   3rd Qu.: 1.0138  
 Max.   : 21.2709   Max.   :15.00   Max.   : 4.6012  
    otherpred.fit        otherpred.lwr        otherpred.upr   
 Min.   :-3.769351    Min.   :-8.118528    Min.   : 0.579827  
 1st Qu.:-1.377772    1st Qu.:-3.600259    1st Qu.: 0.844716  
 Median :-0.181982    Median :-1.990926    Median : 1.626961  
 Mean   :-0.217856    Mean   :-2.670808    Mean   : 2.235096  
 3rd Qu.: 1.013807    3rd Qu.:-1.256057    3rd Qu.: 3.283671  
 Max.   : 4.601175    Max.   :-1.016546    Max.   :10.264689

You’ll get errors trying to access the otherpred “column” if you try dat2$otherpred.

In case you do want to add a multi-column thing to a data frame, the right way to do it will either involve the R function cbind() or merge().

otherpred <- predict(m1, interval = "confidence")
dat3 <- cbind(dat2, otherpred)
head(dat3)

           x1 x2       pred otherpred.fit otherpred.lwr otherpred.upr
1   0.5811232  7 -0.1819824    -0.1819824    -1.9909256     1.6269608
2  21.2709273 10  1.6117017     1.6117017    -1.1127209     4.3361243
3  16.8116932  2 -3.1714559    -3.1714559    -6.9252780     0.5823663
4  -5.8627918 11  2.2095964     2.2095964    -1.0430134     5.4622062
5  -2.6706822  9  1.0138070     1.0138070    -1.2560569     3.2836709
6 -14.5548998  7 -0.1819824    -0.1819824    -1.9909256     1.6269608
         fit       lwr       upr
1 -0.1819824 -1.990926 1.6269608
2  1.6117017 -1.112721 4.3361243
3 -3.1714559 -6.925278 0.5823663
4  2.2095964 -1.043013 5.4622062
5  1.0138070 -1.256057 3.2836709
6 -0.1819824 -1.990926 1.6269608

dat4 <- merge(dat2, otherpred, by = "row.names")
head(dat4)

  Row.names          x1 x2       pred otherpred.fit otherpred.lwr
1         1   0.5811232  7 -0.1819824    -0.1819824    -1.9909256
2        10  -3.5546124  6 -0.7798771    -0.7798771    -2.6936340
3       100 -13.4315275  7 -0.1819824    -0.1819824    -1.9909256
4        11  -3.6873950  9  1.0138070     1.0138070    -1.2560569
5        12 -10.3701365  4 -1.9756665    -1.9756665    -4.6408554
6        13  14.6131337  8  0.4159123     0.4159123    -1.5254483
  otherpred.upr        fit       lwr       upr
1     1.6269608 -0.1819824 -1.990926 1.6269608
2     1.1338798 -0.7798771 -2.693634 1.1338798
3     1.6269608 -0.1819824 -1.990926 1.6269608
4     3.2836709  1.0138070 -1.256057 3.2836709
5     0.6895225 -1.9756665 -4.640855 0.6895225
6     2.3572729  0.4159123 -1.525448 2.3572729

I prefer using merge because, in the olden days, it dealt with missing values in a more graceful way. Today, I don’t think it matters much. Unless I do the merge incorrectly.

I thought it would be easy to show those are identical, but I’m having some trouble. I think my merge is wrong.

Get rid of that first column in dat4

dat4[ , "Row.names"] <- NULL
all.equal(dat3, dat4)

 [1] "Attributes: < Component \"row.names\": Modes: character, numeric >"                                
 [2] "Attributes: < Component \"row.names\": target is character, current is numeric >"                  
 [3] "Component \"x1\": Mean relative difference: 1.329478"                                              
 [4] "Component \"x2\": Mean relative difference: 0.4964789"                                             
 [5] "Component \"pred\": Mean relative difference: 1.545756"                                            
 [6] "Component \"otherpred\": Attributes: < Component \"dimnames\": Component 1: 89 string mismatches >"
 [7] "Component \"otherpred\": Mean relative difference: 0.9939953"                                      
 [8] "Component \"fit\": Mean relative difference: 1.545756"                                             
 [9] "Component \"lwr\": Mean relative difference: 0.7568096"                                            
[10] "Component \"upr\": Mean relative difference: 0.9390171"

sum(dat3$fit - otherpred[ , "fit"])

[1] 0

sum(dat3$lwr -  otherpred[ , "lwr"])

[1] 0

sum(abs(dat4$fit - otherpred[ , "fit"]))

[1] 168.6063

sum(abs(dat4$lwr -otherpred[ , "lwr"]))

[1] 158.6835

plot(dat4$fit, otherpred[ , "fit"])

Humphf!

dat4 <- merge(dat2, otherpred, by = "row.names")
head(dat4)

  Row.names          x1 x2       pred otherpred.fit otherpred.lwr
1         1   0.5811232  7 -0.1819824    -0.1819824    -1.9909256
2        10  -3.5546124  6 -0.7798771    -0.7798771    -2.6936340
3       100 -13.4315275  7 -0.1819824    -0.1819824    -1.9909256
4        11  -3.6873950  9  1.0138070     1.0138070    -1.2560569
5        12 -10.3701365  4 -1.9756665    -1.9756665    -4.6408554
6        13  14.6131337  8  0.4159123     0.4159123    -1.5254483
  otherpred.upr        fit       lwr       upr
1     1.6269608 -0.1819824 -1.990926 1.6269608
2     1.1338798 -0.7798771 -2.693634 1.1338798
3     1.6269608 -0.1819824 -1.990926 1.6269608
4     3.2836709  1.0138070 -1.256057 3.2836709
5     0.6895225 -1.9756665 -4.640855 0.6895225
6     2.3572729  0.4159123 -1.525448 2.3572729

Now I see what’s wrong. Once again, I was bone-crushed by the merge function’s decision to shuffle my rows.

dat4 <- dat4[order(as.numeric(dat4[ , "Row.names"])), ]

plot(dat4$fit, otherpred[ , "fit"])

Or, more simply

dat4 <- merge(dat2, otherpred, by = "row.names", sort = FALSE)

plot(dat4$fit, otherpred[ , "fit"])

Session Info

R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 17.04

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.7.0
LAPACK: /usr/lib/lapack/liblapack.so.3.7.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  base     

other attached packages:
[1] rockchalk_1.8.108 crmda_0.45       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.12       knitr_1.17         magrittr_1.5      
 [4] kutils_1.21        splines_3.4.1      MASS_7.3-47       
 [7] xtable_1.8-2       lattice_0.20-35    minqa_1.2.4       
[10] stringr_1.2.0      car_2.1-5          plyr_1.8.4        
[13] tools_3.4.1        parallel_3.4.1     nnet_7.3-12       
[16] pbkrtest_0.4-7     grid_3.4.1         nlme_3.1-131      
[19] mgcv_1.8-22        quantreg_5.33      MatrixModels_0.4-1
[22] htmltools_0.3.6    yaml_2.1.14        lme4_1.1-13       
[25] rprojroot_1.2      digest_0.6.12      Matrix_1.2-11     
[28] nloptr_1.0.4       evaluate_0.10.1    rmarkdown_1.6     
[31] openxlsx_4.0.17    stringi_1.1.5      compiler_3.4.1    
[34] methods_3.4.1      backports_1.1.0    SparseM_1.77

Available under Created Commons license 3.0 CC BY

R variable types: data.frames!