Abstract
All about lists in R!List is a diverse collection of R objects. Any R object can be inserted in a list.
A list is highly flexible. In versatility, a list is the complete opposite of an R vector or a matrix.
Recall a vector or matrix must be made up of homogeneous elements. If we add an element in a vector (or matrix), it can happen that the entire vector (or matrix) changes as a result. (Recall inserting a character into a numeric vector?)
Below several methods of inserting elements in lists and extracting them will be discussed.
First, we create a small example list for inspection. This is a named list because I insert a name with each element.
mylist <- list("x" = c(1, 2, 3), "y" = matrix(rnorm(16), 4), "z" = "Paul")
names(mylist)
[1] "x" "y" "z"
length(mylist)
[1] 3
This is an unnamed list:
nonamelist <- list(c(1, 2, 3), matrix(rnorm(16), 4), "Paul")
length(nonamelist)
[1] 3
nonamelist
[[1]]
[1] 1 2 3
[[2]]
[,1] [,2] [,3] [,4]
[1,] 0.9078962 0.88526266 0.6956525 1.0344293
[2,] 1.0670807 -0.93670030 1.5539388 -0.9817459
[3,] -0.4885171 -0.01102196 1.7898020 0.2659287
[4,] -1.0072824 0.12531943 -0.7290195 -1.6771190
[[3]]
[1] "Paul"
You agree it has no names, right?
names(nonamelist)
NULL
The elements of a named list can be accessed either by their name or their index number, while an unnamed list allows access only by the index number.
One will find comments here and there in the literature to suggest that lists will be processed more quickly in R if they do not have named elements.
If you want to remove the names from an object, there are two ways.
unname(mylist)
[[1]]
[1] 1 2 3
[[2]]
[,1] [,2] [,3] [,4]
[1,] 1.0247571 0.2222761 0.5916056 0.17307361
[2,] -0.9177907 -0.8432722 0.4910802 -0.09515716
[3,] -0.1960174 1.0017147 0.9647557 -0.16671735
[4,] 1.0909467 -0.7307711 0.4965228 0.33657169
[[3]]
[1] "Paul"
or, equivalently,
names(mylist) <- NULL
mylist
[[1]]
[1] 1 2 3
[[2]]
[,1] [,2] [,3] [,4]
[1,] 1.0247571 0.2222761 0.5916056 0.17307361
[2,] -0.9177907 -0.8432722 0.4910802 -0.09515716
[3,] -0.1960174 1.0017147 0.9647557 -0.16671735
[4,] 1.0909467 -0.7307711 0.4965228 0.33657169
[[3]]
[1] "Paul"
But the gosh darned names are needed for the rest of the presentation, so
names(mylist) <- c("x", "y", "z")
A single-bracket is used to extract subsets from the list, and keep the result as a new list.
mylist2 <- mylist[c(1,3)]
mylist2
$x
[1] 1 2 3
$z
[1] "Paul"
class(mylist2)
[1] "list"
length(mylist2)
[1] 2
The double-bracket [[ is used to copy an object from the list and the result is not a list anymore, it is the object’s type.
I’ll access that element by name first:
mymat1 <- mylist[["y"]]
mymat1
[,1] [,2] [,3] [,4]
[1,] 1.0247571 0.2222761 0.5916056 0.17307361
[2,] -0.9177907 -0.8432722 0.4910802 -0.09515716
[3,] -0.1960174 1.0017147 0.9647557 -0.16671735
[4,] 1.0909467 -0.7307711 0.4965228 0.33657169
class(mymat1)
[1] "matrix"
Then I access that by list position with an integer index:
mymat2 <- mylist[[2]]
mymat2
[,1] [,2] [,3] [,4]
[1,] 1.0247571 0.2222761 0.5916056 0.17307361
[2,] -0.9177907 -0.8432722 0.4910802 -0.09515716
[3,] -0.1960174 1.0017147 0.9647557 -0.16671735
[4,] 1.0909467 -0.7307711 0.4965228 0.33657169
class(mymat2)
[1] "matrix"
identical(mymat1, mymat2)
[1] TRUE
There are two ways to do this. The first is the common, easy way. The second is the faster, more structured way.
mylist1 <- list()
mylist2 <- vector(mode = "list", length = 6)
The major difference between the two types arises when we want to put the lists to use. In the case of mylist1
, we are allowed to add items one by one, either by name or position in the list:
x1 <- c(1, 2, 3)
x2 <- matrix(rnorm(9), ncol = 3)
mylist1[[1]] <- x1
mylist1[["x1"]] <- x1
mylist1[[3]] <- x1
Note that, as far as “mylist1” is concerned, the first item is [[1]], the second item can be found either as [[2]] or [[“x1”]], and the third item is [[3]]:
mylist1
[[1]]
[1] 1 2 3
$x1
[1] 1 2 3
[[3]]
[1] 1 2 3
mylist1[["x1"]]
[1] 1 2 3
mylist1[[2]]
[1] 1 2 3
The list only had 3 elements, but if we insert a 6th element, then R creates NULL elements 4 through 5:
mylist1[[6]] <- x2
mylist1
[[1]]
[1] 1 2 3
$x1
[1] 1 2 3
[[3]]
[1] 1 2 3
[[4]]
NULL
[[5]]
NULL
[[6]]
[,1] [,2] [,3]
[1,] -0.2335816 0.6475242 -1.0499809
[2,] 1.4948868 -0.6139656 -0.9746328
[3,] -0.6158786 0.3947409 -1.0132724
Rememember that the absence of an element in a list is referred to by the symbol NULL, not NA (as for vectors and matrices).
We find the difference in mylist2 is that we are not allowed to insert named elements into the middle of the list in the same way. Observe that because the list was allocated with elements 1 through 6 as NULL, then inserting a named thing “x1” adds a 7th element in the list:
mylist2[[1]] <- x1
mylist2[["x1"]] <- x1
mylist2[[3]] <- x1
mylist2
[[1]]
[1] 1 2 3
[[2]]
NULL
[[3]]
[1] 1 2 3
[[4]]
NULL
[[5]]
NULL
[[6]]
NULL
$x1
[1] 1 2 3
If we want to insert the matrix in the 6th element we can, of course:
mylist2[[6]] <- x2
mylist2
[[1]]
[1] 1 2 3
[[2]]
NULL
[[3]]
[1] 1 2 3
[[4]]
NULL
[[5]]
NULL
[[6]]
[,1] [,2] [,3]
[1,] -0.2335816 0.6475242 -1.0499809
[2,] 1.4948868 -0.6139656 -0.9746328
[3,] -0.6158786 0.3947409 -1.0132724
$x1
[1] 1 2 3
If we decide we want the elements to be named, we can do so with the names function:
## only insert names for 6th and 7th items:
names(mylist2)[6:7] <- c("x1", "x2")
mylist2
[[1]]
[1] 1 2 3
[[2]]
NULL
[[3]]
[1] 1 2 3
[[4]]
NULL
[[5]]
NULL
$x1
[,1] [,2] [,3]
[1,] -0.2335816 0.6475242 -1.0499809
[2,] 1.4948868 -0.6139656 -0.9746328
[3,] -0.6158786 0.3947409 -1.0132724
$x2
[1] 1 2 3
names(mylist2)
[1] "" "" "" "" "" "x1" "x2"
Conclusion: If you are going to generate a lot of objects for a list, it is best to allocate the whole list first and fill in the elements with [[index_number]] <- ...
.
If you want a more flexible list, in which you can insert things with names as you go, it is necessary to initiate the list with list()
but insertion of items is slower.
Allocation of memory is slow, so one argument in favor of the second strategy is that we allocate storage in one step. This is more efficient.
I wondered if it really is more efficient. The right thing would be to formalize this as a microbenchmark experiment, but the system.time function gives a quick snapshot:
alist <- list()
system.time(
for(i in 1:10000){
alist[[i]] <- matrix(rnorm(9), ncol = 3)
})
user system elapsed
0.284 0.008 0.293
alist2 <- vector("list", 10000)
system.time(
for(i in 1:10000){
alist2[[i]] <- matrix(rnorm(9), ncol = 3)
})
user system elapsed
0.056 0.000 0.053
There is a middle ground with the second style. We can create a list with 10 elements and then name them. If we do that, then we can insert things by name. Example, create a list with 10 named things for 10 models:
mylist3 <- vector(mode = "list", length = 10)
names(mylist3) <- paste0("mod", 1:10)
mylist3
$mod1
NULL
$mod2
NULL
$mod3
NULL
$mod4
NULL
$mod5
NULL
$mod6
NULL
$mod7
NULL
$mod8
NULL
$mod9
NULL
$mod10
NULL
Now lets run a data-generator 10 times and fill those in:
set.seed(234234)
mdg <- function(N = 100, beta = c(0.1, 0.3, 0.1), stde = 7)
{
e <- rnorm(N, m = 0, sd = stde)
## oops, don't know parm for predictors
x1 <- rnorm(N, m = 40, sd = 10)
x2 <- rnorm(N, m = 20, sd = 40)
y <- beta[1] + beta[2] * x1 + beta[3] * x2 + e
invisible(data.frame(x1, x2, y))
}
for (i in 1:10){
adf <- mdg()
amodel <- lm(y ~ x1 + x2, data = adf)
mylist3[[paste0("mod", i)]] <- summary(amodel)
}
It is pretty easy to verify that each element in this list is a summary object from the fitted regression.
mylist3[[7]]
Call:
lm(formula = y ~ x1 + x2, data = adf)
Residuals:
Min 1Q Median 3Q Max
-20.1051 -5.7792 -0.0997 3.9366 17.7399
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.84084 2.95784 -0.284 0.777
x1 0.31784 0.07355 4.322 3.75e-05 ***
x2 0.11070 0.01814 6.103 2.13e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.424 on 97 degrees of freedom
Multiple R-squared: 0.3831, Adjusted R-squared: 0.3704
F-statistic: 30.12 on 2 and 97 DF, p-value: 6.666e-11
class(mylist3[[7]])
[1] "summary.lm"
A function, such as “class” or “print”, can be applied to each element in the list in this way.
lapply(mylist3, class)
$mod1
[1] "summary.lm"
$mod2
[1] "summary.lm"
$mod3
[1] "summary.lm"
$mod4
[1] "summary.lm"
$mod5
[1] "summary.lm"
$mod6
[1] "summary.lm"
$mod7
[1] "summary.lm"
$mod8
[1] "summary.lm"
$mod9
[1] "summary.lm"
$mod10
[1] "summary.lm"
For practical purposes, that is the same as “looping” over the elements like this:
for(i in seq_along(mylist3)){
print(class(mylist3[[i]]))
}
[1] "summary.lm"
[1] "summary.lm"
[1] "summary.lm"
[1] "summary.lm"
[1] "summary.lm"
[1] "summary.lm"
[1] "summary.lm"
[1] "summary.lm"
[1] "summary.lm"
[1] "summary.lm"
(The “print()” is needed because, without it, the for loop does not display the output from commands).
Watch out about using for loops. There is social stigma! If you go to StackExchange or the “r-help” list with example code that uses a for loop, you will often be shouted at because for loops are slow in R.
While this is a slight exaggeration, there are cases where clever use of the lapply()
iteration structure is faster. Generally, the reason is that R can look at the request and plan ahead for its calculations, while the for loop hides the long-run details from R. Chores like memory allocation cannot be managed so efficiently. Another fact is that “[” and “[[” are decidely slow operators. We are forcing R to talk back and forth from the R runtime, which is written in C, and the user workspace, which is slowed down by the fact that it interactive.
One reason we use lapply is not simply to print things, but to create a new list that has the result of calculations, with each list element treated one-by-one.
coeflist <- lapply(mylist3, coef)
coeflist[1:3]
$mod1
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.5402023 2.24498675 -1.131500 2.606344e-01
x1 0.3434758 0.05184432 6.625138 1.944652e-09
x2 0.1061137 0.01379968 7.689579 1.218261e-11
$mod2
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.3238934 3.04772533 0.1062738 9.155846e-01
x1 0.3101999 0.07297289 4.2508923 4.899589e-05
x2 0.1032876 0.02125323 4.8598538 4.518067e-06
$mod3
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.76945876 2.93551549 -1.965399 5.222797e-02
x1 0.48123950 0.07103721 6.774471 9.673084e-10
x2 0.07685268 0.01726574 4.451166 2.282910e-05
Somebody said they only want to keep the P values.
pvallist <- lapply(mylist3, function(x){
mycoefs <- coef(x)
pvals <- mycoefs[ , "Pr(>|t|)"]
pvals
})
pvallist
$mod1
(Intercept) x1 x2
2.606344e-01 1.944652e-09 1.218261e-11
$mod2
(Intercept) x1 x2
9.155846e-01 4.899589e-05 4.518067e-06
$mod3
(Intercept) x1 x2
5.222797e-02 9.673084e-10 2.282910e-05
$mod4
(Intercept) x1 x2
7.628776e-01 6.832719e-05 8.846921e-07
$mod5
(Intercept) x1 x2
9.459785e-01 1.007897e-05 7.730988e-09
$mod6
(Intercept) x1 x2
5.351649e-01 2.337192e-04 9.783522e-06
$mod7
(Intercept) x1 x2
7.768047e-01 3.750110e-05 2.134851e-08
$mod8
(Intercept) x1 x2
3.175908e-01 3.725406e-05 1.386217e-05
$mod9
(Intercept) x1 x2
2.135304e-02 2.609510e-09 1.827298e-13
$mod10
(Intercept) x1 x2
1.459893e-01 6.679944e-03 5.533486e-08
sapply
and vapply
The return from that is a series of vectors, we might like to have it as a matrix instead. Many authors suggest the use of R’s “sapply” for that:
sapply(mylist3, function(x){
mycoefs <- coef(x)
pvals <- mycoefs[ , "Pr(>|t|)"]
pvals
})
mod1 mod2 mod3 mod4
(Intercept) 2.606344e-01 9.155846e-01 5.222797e-02 7.628776e-01
x1 1.944652e-09 4.899589e-05 9.673084e-10 6.832719e-05
x2 1.218261e-11 4.518067e-06 2.282910e-05 8.846921e-07
mod5 mod6 mod7 mod8
(Intercept) 9.459785e-01 5.351649e-01 7.768047e-01 3.175908e-01
x1 1.007897e-05 2.337192e-04 3.750110e-05 3.725406e-05
x2 7.730988e-09 9.783522e-06 2.134851e-08 1.386217e-05
mod9 mod10
(Intercept) 2.135304e-02 1.459893e-01
x1 2.609510e-09 6.679944e-03
x2 1.827298e-13 5.533486e-08
IMPORTANT Note the return is a 3 x 10 matrix, one column for each element. Did you expect that? I expected the transpose.
Although sapply()
is widely used, Hadley Wickam suggests instead we focus on learning to use vapply()
in Advanced R:
vapply(mylist3, function(x){
mycoefs <- coef(x)
pvals <- mycoefs[ , "Pr(>|t|)"]
pvals
}, FUN.VALUE = numeric(3))
mod1 mod2 mod3 mod4
(Intercept) 2.606344e-01 9.155846e-01 5.222797e-02 7.628776e-01
x1 1.944652e-09 4.899589e-05 9.673084e-10 6.832719e-05
x2 1.218261e-11 4.518067e-06 2.282910e-05 8.846921e-07
mod5 mod6 mod7 mod8
(Intercept) 9.459785e-01 5.351649e-01 7.768047e-01 3.175908e-01
x1 1.007897e-05 2.337192e-04 3.750110e-05 3.725406e-05
x2 7.730988e-09 9.783522e-06 2.134851e-08 1.386217e-05
mod9 mod10
(Intercept) 2.135304e-02 1.459893e-01
x1 2.609510e-09 6.679944e-03
x2 1.827298e-13 5.533486e-08
Note the difference is the argument FUN.VALUE, where we specify the structure of an individual returned element.
`vapply()
is preferred because it is less likely to give us a result we don’t expect. We told it we think each iteration should return a numeric vector with 3 elements, so R knew what to watch for. If the return did not match that criterion, we would have received an error.
Admittedly, the documentation for vapply is poor and I would never have understood the point of this function without reading Advanced R.
rsq <- vapply(mylist3, function(x){
x$r.square
}, FUN.VALUE = numeric(1))
rsq
mod1 mod2 mod3 mod4 mod5 mod6 mod7
0.5416602 0.2731373 0.3990317 0.3123826 0.4085839 0.3177255 0.3831451
mod8 mod9 mod10
0.2881965 0.5535587 0.3094512
hist(rsq, main = "R Square is the only thing I care about",
xlab = expression(R^2), prob = TRUE)
If a list is a collection of vectors, unlist will take them apart:
alist <- list(1:4, 32:44, rnorm(10))
avec <- unlist(alist)
avec
[1] 1.00000000 2.00000000 3.00000000 4.00000000 32.00000000
[6] 33.00000000 34.00000000 35.00000000 36.00000000 37.00000000
[11] 38.00000000 39.00000000 40.00000000 41.00000000 42.00000000
[16] 43.00000000 44.00000000 0.26628675 1.64484304 -0.91627126
[21] 0.41936098 -0.23667887 -1.88187556 -1.57610338 -0.19895519
[26] 1.17037463 -0.07369298
class(avec)
[1] "numeric"
alist <- list(1:4, 32:44, c("Paul", "Joe"))
avec <- unlist(alist)
avec
[1] "1" "2" "3" "4" "32" "33" "34" "35" "36"
[10] "37" "38" "39" "40" "41" "42" "43" "44" "Paul"
[19] "Joe"
class(avec)
[1] "character"
Sometimes unlisting is more aggressive than we expect. Run unlist(mylist3)
and you’ll see what 10 regressions look like when all of their numbers are flattened into a single vector.
To remove an element from a list, it must be assigned the NULL value:
nonamelist[[3]] <- NULL
nonamelist
[[1]]
[1] 1 2 3
[[2]]
[,1] [,2] [,3] [,4]
[1,] 0.9078962 0.88526266 0.6956525 1.0344293
[2,] 1.0670807 -0.93670030 1.5539388 -0.9817459
[3,] -0.4885171 -0.01102196 1.7898020 0.2659287
[4,] -1.0072824 0.12531943 -0.7290195 -1.6771190
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] knitr_1.14 rmarkdown_1.0
loaded via a namespace (and not attached):
[1] magrittr_1.5 formatR_1.4 tools_3.3.1 htmltools_0.3.5
[5] yaml_2.1.13 Rcpp_0.12.6 stringi_1.1.1 stringr_1.1.0
[9] digest_0.6.10 evaluate_0.9
Available under Created Commons license 3.0