Paul Johnson, CRMDA <pauljohn@ku.edu>
Please visit http://pj.freefaculty.org/guides
Keywords: R,vectors
September 20 2017
Abstract
A data frame is a list, but with one special difference. The elements in a data.frame must all have the same number of items. Think of it as a rectangle, use “View” to see it.
List is a diverse collection of R objects. Any R object can be inserted in a list. #### Data Frame {.bs-callout .bs-callout-red} An R data.frame is an R list, but with one restriction: The number of rows in each element in the list must be identical.
A “spread sheet” is the usual way to think of a data.frame. Each column is a variable and each row is a survey respondent or participant in a study.
Lets make some variables of different types:
N <- 100
x1 <- rnorm(N, m = 0, sd = 10)
x2 <- rpois(N, lambda = 7)
x3 <- sample(letters[1:26], N, replace = TRUE)
x4 <- gl(5, N/5, labels = c("low", "luke", "med", "warm", "hot"))
class(x1)
[1] "numeric"
class(x2)
[1] "integer"
class(x3)
[1] "character"
class(x4)
[1] "factor"
The data.frame()
function will staple those together as columns:
dat <- data.frame(x1, x2, x3, x4, stringsAsFactors = FALSE)
head(dat)
x1 x2 x3 x4
1 3.613288 6 u low
2 1.642283 7 l low
3 -14.591250 5 k low
4 -1.791002 8 y low
5 -5.678591 6 w low
6 -3.973017 4 w low
str(dat)
'data.frame': 100 obs. of 4 variables:
$ x1: num 3.61 1.64 -14.59 -1.79 -5.68 ...
$ x2: int 6 7 5 8 6 4 12 11 5 8 ...
$ x3: chr "u" "l" "k" "y" ...
$ x4: Factor w/ 5 levels "low","luke","med",..: 1 1 1 1 1 1 1 1 1 1 ...
Inspect some rows by syntax dat[ index, ]
, similar to matrices
dat[c(1, 10:14, 99), ]
x1 x2 x3 x4
1 3.613288 6 u low
10 5.620089 8 q low
11 15.111557 10 m low
12 9.137402 9 x low
13 4.314263 6 y low
14 15.040315 10 z low
99 -8.879234 11 c hot
Use View
View(dat)
opens up a table view
Extract a column in either of 3.5 ways!
a. Take the 3rd column by integer index
dat[ , 3]
[1] "u" "l" "k" "y" "w" "w" "u" "t" "m" "q" "m" "x" "y" "z" "v" "m"
[17] "x" "k" "n" "o" "l" "b" "w" "a" "t" "w" "e" "w" "a" "l" "h" "x"
[33] "e" "t" "g" "p" "k" "n" "d" "b" "b" "k" "k" "w" "b" "d" "r" "b"
[49] "c" "d" "t" "o" "k" "t" "n" "l" "f" "d" "q" "u" "k" "i" "b" "w"
[65] "n" "v" "c" "j" "g" "e" "w" "i" "t" "w" "q" "h" "t" "w" "f" "m"
[81] "v" "c" "h" "d" "x" "g" "b" "u" "w" "t" "i" "u" "f" "y" "d" "l"
[97] "q" "f" "c" "w"
b. Take the 3rd column by its name
dat[ , "x3"]
[1] "u" "l" "k" "y" "w" "w" "u" "t" "m" "q" "m" "x" "y" "z" "v" "m"
[17] "x" "k" "n" "o" "l" "b" "w" "a" "t" "w" "e" "w" "a" "l" "h" "x"
[33] "e" "t" "g" "p" "k" "n" "d" "b" "b" "k" "k" "w" "b" "d" "r" "b"
[49] "c" "d" "t" "o" "k" "t" "n" "l" "f" "d" "q" "u" "k" "i" "b" "w"
[65] "n" "v" "c" "j" "g" "e" "w" "i" "t" "w" "q" "h" "t" "w" "f" "m"
[81] "v" "c" "h" "d" "x" "g" "b" "u" "w" "t" "i" "u" "f" "y" "d" "l"
[97] "q" "f" "c" "w"
c. Take the 3rd column by the $ "accessor" shortcut.
dat$x3
[1] "u" "l" "k" "y" "w" "w" "u" "t" "m" "q" "m" "x" "y" "z" "v" "m"
[17] "x" "k" "n" "o" "l" "b" "w" "a" "t" "w" "e" "w" "a" "l" "h" "x"
[33] "e" "t" "g" "p" "k" "n" "d" "b" "b" "k" "k" "w" "b" "d" "r" "b"
[49] "c" "d" "t" "o" "k" "t" "n" "l" "f" "d" "q" "u" "k" "i" "b" "w"
[65] "n" "v" "c" "j" "g" "e" "w" "i" "t" "w" "q" "h" "t" "w" "f" "m"
[81] "v" "c" "h" "d" "x" "g" "b" "u" "w" "t" "i" "u" "f" "y" "d" "l"
[97] "q" "f" "c" "w"
d. Because a data.frame is, technically, also an R list, it is
allowed to access columns in the way that list elements are
accessed.
Observe:
x3.1 <- dat["x3"]
class(x3.1)
[1] "data.frame"
Note that x3.1 is still a data.frame object, which has this weird-looking implication.
x3.1$x3
[1] "u" "l" "k" "y" "w" "w" "u" "t" "m" "q" "m" "x" "y" "z" "v" "m"
[17] "x" "k" "n" "o" "l" "b" "w" "a" "t" "w" "e" "w" "a" "l" "h" "x"
[33] "e" "t" "g" "p" "k" "n" "d" "b" "b" "k" "k" "w" "b" "d" "r" "b"
[49] "c" "d" "t" "o" "k" "t" "n" "l" "f" "d" "q" "u" "k" "i" "b" "w"
[65] "n" "v" "c" "j" "g" "e" "w" "i" "t" "w" "q" "h" "t" "w" "f" "m"
[81] "v" "c" "h" "d" "x" "g" "b" "u" "w" "t" "i" "u" "f" "y" "d" "l"
[97] "q" "f" "c" "w"
We probably did not want a data frame with just x3, so the double bracket comes in handy
x3.2 <- dat[["x3"]]
class(x3.2)
[1] "character"
is.factor(x3.2)
[1] FALSE
summary(dat)
x1 x2 x3 x4
Min. :-21.079 Min. : 2.00 Length:100 low :20
1st Qu.: -5.655 1st Qu.: 5.00 Class :character luke:20
Median : 1.434 Median : 7.00 Mode :character med :20
Mean : 2.092 Mean : 7.16 warm:20
3rd Qu.: 9.885 3rd Qu.: 9.00 hot :20
Max. : 25.535 Max. :14.00
I think output from rockchalk summarise is better
library(rockchalk)
summarize(dat)
Numeric variables
x1 x2
min -21.08 2
med 1.43 7
max 25.53 14
mean 2.09 7.16
sd 10.86 2.71
skewness 0.10 0.22
kurtosis -0.63 -0.79
nobs 100 100
nmissing 0 0
Nonnumeric variables
x3 x4
w : 12 low : 20
t : 8 luke : 20
b : 7 med : 20
k : 7 warm : 20
(All Others): 66 hot : 20
nobs : 100 nobs : 100
nmiss : 0 nmiss: 0
entropy : 4.39 entropy : 2.3
normedEntropy: 0.94 normedEntropy: 1.0
dimnames
function to rename both rows and columns in one command. This is identical to the way it is done in an R matrix:dimnames(dat) <- list(paste0("r", 1:100), paste0('a', 1:4))
head(dat)
a1 a2 a3 a4
r1 3.613288 6 u low
r2 1.642283 7 l low
r3 -14.591250 5 k low
r4 -1.791002 8 y low
r5 -5.678591 6 w low
r6 -3.973017 4 w low
colnames()
and rownames()
can be used to retrieve names or set them, depending on whether they are followed by <-
.colnames(dat)
[1] "a1" "a2" "a3" "a4"
colnames(dat) <- c("x1", "x2", "x3", "x4")
head(dat)
x1 x2 x3 x4
r1 3.613288 6 u low
r2 1.642283 7 l low
r3 -14.591250 5 k low
r4 -1.791002 8 y low
r5 -5.678591 6 w low
r6 -3.973017 4 w low
names()
function.dat$x2log <- log(dat$x2)
I usually think of a data.frame as a set of columns. I think most people do. However, that’s just wrong. A data.frame object can have elements that are matrices or other data.frames.
This often happens by accident. I do a calculation where I add a column to a data frame.
N <- 100
x1 <- rnorm(N, m = 0, sd = 10)
x2 <- rpois(N, lambda = 7)
dat2 <- data.frame(x1, x2)
Here’s a fitted regression:
m1 <- lm(x1 ~ x2, data = dat2)
Often, we might take predicted values or residuals, say
dat2$pred <- predict(m1)
That’s OK, as you can see we have a new column on the right side of the data frame:
head(dat2)
x1 x2 pred
1 0.5811232 7 -0.1819824
2 21.2709273 10 1.6117017
3 16.8116932 2 -3.1714559
4 -5.8627918 11 2.2095964
5 -2.6706822 9 1.0138070
6 -14.5548998 7 -0.1819824
However, a bad accident can happen if the return from predict happens to be a matrix. Consider this:
dat2$otherpred <- predict(m1, interval = "confidence")
The thing, “otherpred” is a matrix with 3 columns. However, R let me insert it onto the data frame as if it were a column. Now, accessing those elements will be SUPER-confusing.
head(dat2)
x1 x2 pred otherpred.fit otherpred.lwr otherpred.upr
1 0.5811232 7 -0.1819824 -0.1819824 -1.9909256 1.6269608
2 21.2709273 10 1.6117017 1.6117017 -1.1127209 4.3361243
3 16.8116932 2 -3.1714559 -3.1714559 -6.9252780 0.5823663
4 -5.8627918 11 2.2095964 2.2095964 -1.0430134 5.4622062
5 -2.6706822 9 1.0138070 1.0138070 -1.2560569 3.2836709
6 -14.5548998 7 -0.1819824 -0.1819824 -1.9909256 1.6269608
summary(dat2)
x1 x2 pred
Min. :-25.1711 Min. : 1.00 Min. :-3.7694
1st Qu.: -6.0451 1st Qu.: 5.00 1st Qu.:-1.3778
Median : 0.4727 Median : 7.00 Median :-0.1820
Mean : -0.2179 Mean : 6.94 Mean :-0.2179
3rd Qu.: 5.6533 3rd Qu.: 9.00 3rd Qu.: 1.0138
Max. : 21.2709 Max. :15.00 Max. : 4.6012
otherpred.fit otherpred.lwr otherpred.upr
Min. :-3.769351 Min. :-8.118528 Min. : 0.579827
1st Qu.:-1.377772 1st Qu.:-3.600259 1st Qu.: 0.844716
Median :-0.181982 Median :-1.990926 Median : 1.626961
Mean :-0.217856 Mean :-2.670808 Mean : 2.235096
3rd Qu.: 1.013807 3rd Qu.:-1.256057 3rd Qu.: 3.283671
Max. : 4.601175 Max. :-1.016546 Max. :10.264689
You’ll get errors trying to access the otherpred “column” if you try dat2$otherpred.
In case you do want to add a multi-column thing to a data frame, the right way to do it will either involve the R function cbind()
or merge()
.
otherpred <- predict(m1, interval = "confidence")
dat3 <- cbind(dat2, otherpred)
head(dat3)
x1 x2 pred otherpred.fit otherpred.lwr otherpred.upr
1 0.5811232 7 -0.1819824 -0.1819824 -1.9909256 1.6269608
2 21.2709273 10 1.6117017 1.6117017 -1.1127209 4.3361243
3 16.8116932 2 -3.1714559 -3.1714559 -6.9252780 0.5823663
4 -5.8627918 11 2.2095964 2.2095964 -1.0430134 5.4622062
5 -2.6706822 9 1.0138070 1.0138070 -1.2560569 3.2836709
6 -14.5548998 7 -0.1819824 -0.1819824 -1.9909256 1.6269608
fit lwr upr
1 -0.1819824 -1.990926 1.6269608
2 1.6117017 -1.112721 4.3361243
3 -3.1714559 -6.925278 0.5823663
4 2.2095964 -1.043013 5.4622062
5 1.0138070 -1.256057 3.2836709
6 -0.1819824 -1.990926 1.6269608
dat4 <- merge(dat2, otherpred, by = "row.names")
head(dat4)
Row.names x1 x2 pred otherpred.fit otherpred.lwr
1 1 0.5811232 7 -0.1819824 -0.1819824 -1.9909256
2 10 -3.5546124 6 -0.7798771 -0.7798771 -2.6936340
3 100 -13.4315275 7 -0.1819824 -0.1819824 -1.9909256
4 11 -3.6873950 9 1.0138070 1.0138070 -1.2560569
5 12 -10.3701365 4 -1.9756665 -1.9756665 -4.6408554
6 13 14.6131337 8 0.4159123 0.4159123 -1.5254483
otherpred.upr fit lwr upr
1 1.6269608 -0.1819824 -1.990926 1.6269608
2 1.1338798 -0.7798771 -2.693634 1.1338798
3 1.6269608 -0.1819824 -1.990926 1.6269608
4 3.2836709 1.0138070 -1.256057 3.2836709
5 0.6895225 -1.9756665 -4.640855 0.6895225
6 2.3572729 0.4159123 -1.525448 2.3572729
I prefer using merge because, in the olden days, it dealt with missing values in a more graceful way. Today, I don’t think it matters much. Unless I do the merge incorrectly.
I thought it would be easy to show those are identical, but I’m having some trouble. I think my merge is wrong.
Get rid of that first column in dat4
dat4[ , "Row.names"] <- NULL
all.equal(dat3, dat4)
[1] "Attributes: < Component \"row.names\": Modes: character, numeric >"
[2] "Attributes: < Component \"row.names\": target is character, current is numeric >"
[3] "Component \"x1\": Mean relative difference: 1.329478"
[4] "Component \"x2\": Mean relative difference: 0.4964789"
[5] "Component \"pred\": Mean relative difference: 1.545756"
[6] "Component \"otherpred\": Attributes: < Component \"dimnames\": Component 1: 89 string mismatches >"
[7] "Component \"otherpred\": Mean relative difference: 0.9939953"
[8] "Component \"fit\": Mean relative difference: 1.545756"
[9] "Component \"lwr\": Mean relative difference: 0.7568096"
[10] "Component \"upr\": Mean relative difference: 0.9390171"
sum(dat3$fit - otherpred[ , "fit"])
[1] 0
sum(dat3$lwr - otherpred[ , "lwr"])
[1] 0
sum(abs(dat4$fit - otherpred[ , "fit"]))
[1] 168.6063
sum(abs(dat4$lwr -otherpred[ , "lwr"]))
[1] 158.6835
plot(dat4$fit, otherpred[ , "fit"])
Humphf!
dat4 <- merge(dat2, otherpred, by = "row.names")
head(dat4)
Row.names x1 x2 pred otherpred.fit otherpred.lwr
1 1 0.5811232 7 -0.1819824 -0.1819824 -1.9909256
2 10 -3.5546124 6 -0.7798771 -0.7798771 -2.6936340
3 100 -13.4315275 7 -0.1819824 -0.1819824 -1.9909256
4 11 -3.6873950 9 1.0138070 1.0138070 -1.2560569
5 12 -10.3701365 4 -1.9756665 -1.9756665 -4.6408554
6 13 14.6131337 8 0.4159123 0.4159123 -1.5254483
otherpred.upr fit lwr upr
1 1.6269608 -0.1819824 -1.990926 1.6269608
2 1.1338798 -0.7798771 -2.693634 1.1338798
3 1.6269608 -0.1819824 -1.990926 1.6269608
4 3.2836709 1.0138070 -1.256057 3.2836709
5 0.6895225 -1.9756665 -4.640855 0.6895225
6 2.3572729 0.4159123 -1.525448 2.3572729
Now I see what’s wrong. Once again, I was bone-crushed by the merge function’s decision to shuffle my rows.
dat4 <- dat4[order(as.numeric(dat4[ , "Row.names"])), ]
plot(dat4$fit, otherpred[ , "fit"])
Or, more simply
dat4 <- merge(dat2, otherpred, by = "row.names", sort = FALSE)
plot(dat4$fit, otherpred[ , "fit"])
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 17.04
Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.7.0
LAPACK: /usr/lib/lapack/liblapack.so.3.7.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets base
other attached packages:
[1] rockchalk_1.8.108 crmda_0.45
loaded via a namespace (and not attached):
[1] Rcpp_0.12.12 knitr_1.17 magrittr_1.5
[4] kutils_1.21 splines_3.4.1 MASS_7.3-47
[7] xtable_1.8-2 lattice_0.20-35 minqa_1.2.4
[10] stringr_1.2.0 car_2.1-5 plyr_1.8.4
[13] tools_3.4.1 parallel_3.4.1 nnet_7.3-12
[16] pbkrtest_0.4-7 grid_3.4.1 nlme_3.1-131
[19] mgcv_1.8-22 quantreg_5.33 MatrixModels_0.4-1
[22] htmltools_0.3.6 yaml_2.1.14 lme4_1.1-13
[25] rprojroot_1.2 digest_0.6.12 Matrix_1.2-11
[28] nloptr_1.0.4 evaluate_0.10.1 rmarkdown_1.6
[31] openxlsx_4.0.17 stringi_1.1.5 compiler_3.4.1
[34] methods_3.4.1 backports_1.1.0 SparseM_1.77
Available under Created Commons license 3.0