Abstract

The most common recoding chore that comes with multilevel models is creating the group mean and individual deviations within group variables. This describes how that can be done with tools provided in the base of R, as well as in some addon packages.

Multilevel modelers need group means

Multilevel modelers might need group-level summary information for various reasons. They often want to change a multilevel regression from this

\[\begin{equation} y_{ji} = \beta_0 + \beta_1 x_{ji} + \epsilon_{ji} \label{eq:eq1} \end{equation}\]

\[ y_{ji} = \beta_0 + \beta_1 \bar{x}_{j.} + \beta_2 \widetilde{x}_{ji} + \epsilon_{ji} \label{eq:eq2} \]

where \(\bar{x}_{j.}\) is the mean of \(x_{ji}\) within group \(j\) and \(\widetilde{x}_{ji} = x_{ji} -\bar{x}_{j.}\). The variable \(\widetilde{x}_{ji}\) is the “group mean-centered” value of \(x_{ji}\).

Changing the specification from the first equation to the second equation is a way of checking for the model’s specification. If the effect of changing the within group mean is markedly different from changing the value of the individual’s score, then the ecological fallacy might be in evidence. This specification is widely known as “Mundlak’s suggestion” for exploring the effect of a numeric predictor at the contextual and individual levels.

There is an extended discussion of this coding in Snijders and Bosker(). It turns out that the preceeding model is statistically equivalent to this,

\[ y_{ji} = \beta_0 + \beta_3 \bar{x}_{j.} + \beta_4 x_{ji} + \epsilon_{ji} \label{eq:eq3} \]

This model appears to be more simple because it includes the individual variable, \(x_{ji}\), as a predictor, rather than \(\widetilde{x}_{ji}\). However, this fit is statistically equivalent. The fitted regression lines are identical and the difference between the fits is a bit of an illusion. All of the information contained in one is actually present in the other.

\[\begin{align} y_{ji} & = \beta_0 + \beta_1 \bar{x}_{j.} + \beta_2 \widetilde{x}_{ji} + \epsilon_{ji}\\ & = \beta_0 + \beta_1 \bar{x}_{j.} + \beta_2( x_{ji} -\bar{x}_{j.}) + \epsilon_{ji}\\ & = \beta_0 + (\beta_1-\beta_2)\bar{x}_{j.} + \beta_2(x_{ji}) + \epsilon_{ji} \end{align}\]

Clearly, the “individual level effect” is the same in both models, \(\beta_2 = \beta_4\). And, equally clearly, the superficial difference we see in the parameter estimates is easily understood as \(\beta_3 = (\beta_1-\beta_2)\).

While these are equivalent models, estimating a model with \(\widetilde{x}_{ji}\), the individual deviations within groups, is considered to be suggestive by some authors. It is a way of explicitly estimationg a “big fish in a small pond” effect, as if to say that individual variations about the group mean are somehow more interesting that individual variations themselves.

In Raudenbush and Bryk (), the variable \(\widetilde{x}_{ji}\) is given a random intercept. That is an entirely different specification, of course.

In any case, the group means are necessary for the analysis and, for some purposes, the “group mean centered” individual data is thought-provoking.

Example data

The High School and Beyond data set summarizes mathematical and English language achievement for students. I have instructions about downloading and importing that data on this web page:

http://pj.freefaculty.org/guides/stat/DataSets/HSB/00-README.txt

The following will download a copy of the data if one is not already available in the working directory.

if (file.exists("hsb.rds")){
    hsb <- readRDS("hsb.rds")
} else {
    library(foreign)
    hsb.url <- "http://www.stata-press.com/data/mlmus3/hsb.dta"
    hsb <- read.dta(hsb.url)
    saveRDS(hsb, "hsb.rds")
}

hsb$schoolidf <- factor(hsb$schoolid)

Stata offers egen

If a variable is named ses and we want to calculate the mean for each school, we run

egen ses_mn = mean(ses), by(schoolid)

The deviation within group is easy to calculate after that.

gen ses_dev = ses - ses_mn

In Stata, one is accustomed to the idea that the data frame will be altered by the addition of a new variable.

In the Rabe-Hesketh and Skrondal book, a method is describe to use a for loop to iterate through many variables and create group mean columns for them.

Here we can practice by getting the group means for the variables ses and female.

foreach var of varlist ses female {
    egen `var'_mn = mean(`var'), by(schoolid)
}

foreach var of varlist ses  female {
    gen `var'_dev = `var' - `var'_mn
}

How to get the same work done with R

R offers many functions in the base that can produce the group means. The more difficult challenge is to “duplicate” the means back onto the data frame so that we can put them to use.

Base functions

aggregate and by are convenience functions in the R base that are intended to simplify the use of tapply.

ses_mn1 <- aggregate(hsb[, "ses", drop=FALSE], by = list(schoolid = hsb$schoolidf), mean, na.rm = TRUE)
head(ses_mn1)

  schoolid         ses
1     1224 -0.43438298
2     1288  0.12159999
3     1296 -0.42550000
4     1308  0.52800000
5     1317  0.34533333
6     1358 -0.01966667

str(ses_mn1)

'data.frame':   160 obs. of  2 variables:
 $ schoolid: Factor w/ 160 levels "1224","1288",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ ses     : num  -0.434 0.122 -0.426 0.528 0.345 ...

class(ses_mn1)

[1] "data.frame"

ses_mn2 <- by(hsb$ses, hsb$schoolidf, mean, na.rm = TRUE)
head(ses_mn2)

hsb$schoolidf
       1224        1288        1296        1308        1317 
-0.43438298  0.12159999 -0.42550000  0.52800000  0.34533333 
       1358 
-0.01966667

str(ses_mn2)

 by [1:160(1d)] -0.434 0.122 -0.426 0.528 0.345 ...
 - attr(*, "dimnames")=List of 1
  ..$ hsb$schoolidf: chr [1:160] "1224" "1288" "1296" "1308" ...
 - attr(*, "call")= language by.default(data = hsb$ses, INDICES = hsb$schoolidf, FUN = mean,      na.rm = TRUE)

class(ses_mn2)

[1] "by"

ses_mn3 <- tapply(hsb$ses, hsb$schoolidf, mean, na.rm = TRUE)
head(ses_mn3)

       1224        1288        1296        1308        1317 
-0.43438298  0.12159999 -0.42550000  0.52800000  0.34533333 
       1358 
-0.01966667

str(ses_mn3)

 num [1:160(1d)] -0.434 0.122 -0.426 0.528 0.345 ...
 - attr(*, "dimnames")=List of 1
  ..$ : chr [1:160] "1224" "1288" "1296" "1308" ...

class(ses_mn3)

[1] "array"

Using any of these methods, we are able to retrieve the means of the groups.

The challenge is to put those means back onto the data frame. It is easy to get this wrong because the results from tapply do not come back in the expected order. Here is a toy example to demonstrate that.

toy <- data.frame(x1 = rnorm(13),
                  x2 = c(rep("Z", 3), rep("P", 2), rep("A", 3), rep("C", 5)),
                  stringsAsFactors = FALSE)
toy

            x1 x2
1   1.68896279  Z
2  -1.70367563  Z
3   0.05274744  Z
4  -0.38545901  P
5  -0.68566177  P
6  -1.14125006  A
7  -0.43040873  A
8   1.95247009  A
9   0.38286307  C
10 -0.61537067  C
11  0.57555937  C
12  0.02824677  C
13 -1.74853165  C

x1_mn <- tapply(toy$x1, toy$x2, mean)
x1_mn

         A          C          P          Z 
 0.1269371 -0.2754466 -0.5355604  0.0126782

Note that the ordering from top to bottom in the data frame has groups Z, P, A and C, the ordering of tapply is different. tapply creates a factor variable, which, by default, uses the available values in alphabetical order as the levels. Hence, a decision to manufacture a vector by the obvious sort of trick will result in a ghastly error.

To put the data “back onto” the data frame requires the careful use of index subscripts or the merge function.

x1_mn <- aggregate(toy[ , "x1", drop = FALSE], by = list("x2" = toy$x2), mean, na.rm = TRUE)
x1_mn

  x2         x1
1  A  0.1269371
2  C -0.2754466
3  P -0.5355604
4  Z  0.0126782

toy2 <- merge(toy, x1_mn, by = "x2", suffix = c("", "_mn"), sort = FALSE)
toy2

   x2          x1      x1_mn
1   Z  1.68896279  0.0126782
2   Z -1.70367563  0.0126782
3   Z  0.05274744  0.0126782
4   P -0.38545901 -0.5355604
5   P -0.68566177 -0.5355604
6   A -1.14125006  0.1269371
7   A -0.43040873  0.1269371
8   A  1.95247009  0.1269371
9   C  0.38286307 -0.2754466
10  C -0.61537067 -0.2754466
11  C  0.57555937 -0.2754466
12  C  0.02824677 -0.2754466
13  C -1.74853165 -0.2754466

You usually want more than one variable

Variable “mmses” is the mean of ses, within schools, which we verify here Create a vector of variable names for which we need school level means

vars <- c("ses", "female")
ses_mn <- aggregate(hsb[ , vars],
                          by = list("schoolidf" = hsb$schoolidf),  mean, na.rm = TRUE)

hsb2 <- merge(hsb, ses_mn, by = "schoolidf", suffix = c("", "_mn"))
for(i in vars){
    hsb2[ , paste0(i, "_dev")] <- hsb2[ , i] - hsb2[ , paste0(i, "_mn")]
}

An R function like Stata’s egen

##' Generate group summaries and individual deviations within groups
##'
##' Similar to Stata egen, except more versatile and fun! Will
##' create a new data frame with 2 columns. Rows match
##' the rows of the original data frame. 
##'
##' Now I return only the new columns
##' @param dframe a data frame
##' @param x Variable names or a vector of variable names
##' @param by A grouping variable name or a vector of grouping names
##' @param FUN Defaults to the mean, have not tested alternatives
##' @param suffix The suffixes to be added to column 1 and column 2
##' @return new data frame with "x_mn" and "x_dev" added as variables
##' @author Paul Johnson
##' @examples
##' Suppose you get the MLMUS hsb data frame, somehow
##' xx1 <- gmd(hsb, "ses", "schoolidf")
##' xx2 <- gmd(hsb, c("ses", "female"), "schoolidf")
##' xx3 <- gmd(hsb, c("ses", "female"), c("schoolidf", "sector"))
##' xx4 <- gmd(hsb, c("ses", "female"),
##'                    c("schoolidf"), FUN = median)
gmd <- function(dframe, x, by, FUN = mean, suffix = c("_mn", "_dev")){
    xmean <- aggregate(dframe[ , x, drop = FALSE],
                       dframe[ , by, drop = FALSE], FUN,
                       na.rm = TRUE)
    df2 <- merge(dframe, xmean, by = by, suffix = c("", suffix[1]))
    for(i in x){
        df2[ , paste0(i, suffix[2])] <- df2[ , i] - df2[ , paste0(i, suffix[1])]
    }
    df2[ , colnames(df2)[!colnames(df2) %in% colnames(dframe)]]
}


xx1 <- gmd(hsb, "ses", "schoolidf")
head(xx1)

     ses_mn     ses_dev
1 -0.434383 -1.09361702
2 -0.434383 -0.15361702
3 -0.434383 -0.09361702
4 -0.434383 -0.23361701
5 -0.434383  0.27638297
6 -0.434383  0.45638298

## Could combine with hsb
## hsb <- cbind(hsb, xx1)
## head(hsb[ , c("schoolidf", "ses", "ses_mn", "ses_dev")])
xx2 <- gmd(hsb, c("ses", "female"), "schoolidf")
head(xx2)

     ses_mn female_mn     ses_dev female_dev
1 -0.434383 0.5957447 -1.09361702  0.4042553
2 -0.434383 0.5957447 -0.15361702  0.4042553
3 -0.434383 0.5957447 -0.09361702 -0.5957447
4 -0.434383 0.5957447 -0.23361701 -0.5957447
5 -0.434383 0.5957447  0.27638297 -0.5957447
6 -0.434383 0.5957447  0.45638298 -0.5957447

hsb <- cbind(hsb, xx2)
head(hsb[ , c("schoolidf", "ses", "ses_mn", "ses_dev", "female", "female_mn", "female_dev")])

  schoolidf    ses    ses_mn     ses_dev female female_mn female_dev
1      1224 -1.528 -0.434383 -1.09361702      1 0.5957447  0.4042553
2      1224 -0.588 -0.434383 -0.15361702      1 0.5957447  0.4042553
3      1224 -0.528 -0.434383 -0.09361702      0 0.5957447 -0.5957447
4      1224 -0.668 -0.434383 -0.23361701      0 0.5957447 -0.5957447
5      1224 -0.158 -0.434383  0.27638297      0 0.5957447 -0.5957447
6      1224  0.022 -0.434383  0.45638298      0 0.5957447 -0.5957447

Why do I bother with base functions?

This function, which uses only R base tools, is not as fast as possible. The data set is is only 7185 rows with 26 columns, not big data by any sense of the imagination.

If we could simply insert a column into a data.frame without causing a full memory re-write of the data.frame within R, it would be faster. The fear of excess data copying is what fuels innovations during the past 5-10 years, such as the packages data.table plyr and dplyr.

All of these packages sit on top of the famous R package “Rcpp”, which is rightly praised as a revolutionary advance in R for high performance problems. Unfortunately, it is also one of the sources of periodic project failures because certain data sets expose bugs (which are fixed eventually, usually quickly).

Simply put, relying on more contributed packages can cause project failures. Recently, we have had 2 projects come to a halt because of little bugs in packages that were relied upon by other packages that we were using. These problems get fixed, eventually, but why complicate the workflow unnecessarily.

If you are willing to use data.table

When data sets are truly immense, the data.table package offers the only tools I’ve found to quickly sort, merge, and calculate-by-subgroup. data.table introduces an entirely new style of R programming, touted by its advocates as more “elegant” and “understandable”. The CRMDA has a data.table guide available in our collection.

Here is an example of data.table in action.

library(data.table)
hsbdt <- as.data.table(hsb)
setkey(hsbdt, schoolidf)
hsbdt[ , ses_mn2 := mean(ses, na.rm = TRUE), by = schoolidf]

      minority female    ses mathach size sector pracad disclim
   1:        0      1 -1.528   5.876  842      0   0.35   1.597
   2:        0      1 -0.588  19.708  842      0   0.35   1.597
   3:        0      0 -0.528  20.349  842      0   0.35   1.597
   4:        0      0 -0.668   8.781  842      0   0.35   1.597
   5:        0      0 -0.158  17.898  842      0   0.35   1.597
  ---                                                          
7181:        0      1  1.512  20.402  262      1   1.00  -2.416
7182:        0      1 -0.038  14.794  262      1   1.00  -2.416
7183:        0      1  1.332  19.641  262      1   1.00  -2.416
7184:        0      1 -0.008  16.241  262      1   1.00  -2.416
7185:        0      1  0.792  22.733  262      1   1.00  -2.416
      himinty schoolid      mean       sd    sdalt       junk
   1:       0     1224  9.715446 7.592785 6.256328  45.710770
   2:       0     1224  9.715446 7.592785 6.256328  49.999413
   3:       0     1224  9.715446 7.592785 6.256328  59.475361
   4:       0     1224  9.715446 7.592785 6.256328  14.868533
   5:       0     1224  9.715446 7.592785 6.256328  27.678404
  ---                                                        
7181:       0     9586 14.863695 6.415999 6.256328  60.295639
7182:       0     9586 14.863695 6.415999 6.256328   4.652761
7183:       0     9586 14.863695 6.415999 6.256328  49.056393
7184:       0     9586 14.863695 6.415999 6.256328  12.988998
7185:       0     9586 14.863695 6.415999 6.256328 101.929749
        sdalt2 num       se     sealt    sealt2       t2    t2alt
   1: 48.39363  47 1.107522 0.9125792 1.0147176 6.958498 8.289523
   2: 48.39363  47 1.107522 0.9125792 1.0147176 6.958498 8.289523
   3: 48.39363  47 1.107522 0.9125792 1.0147176 6.958498 8.289523
   4: 48.39363  47 1.107522 0.9125792 1.0147176 6.958498 8.289523
   5: 48.39363  47 1.107522 0.9125792 1.0147176 6.958498 8.289523
  ---                                                            
7181: 48.39363  59 0.835292 0.8145045 0.9056661 7.106490 6.044989
7182: 48.39363  59 0.835292 0.8145045 0.9056661 7.106490 6.044989
7183: 48.39363  59 0.835292 0.8145045 0.9056661 7.106490 6.044989
7184: 48.39363  59 0.835292 0.8145045 0.9056661 7.106490 6.044989
7185: 48.39363  59 0.835292 0.8145045 0.9056661 7.106490 6.044989
      pickone      mmses      mnses       xb       resid schoolidf
   1:       1 -0.4343830 -0.4343830 10.13759 -4.26158857      1224
   2:       0 -0.4343830 -0.4343830 10.13759  9.57041168      1224
   3:       0 -0.4343830 -0.4343830 10.13759 10.21141243      1224
   4:       0 -0.4343830 -0.4343830 10.13759 -1.35658836      1224
   5:       0 -0.4343830 -0.4343830 10.13759  7.76041222      1224
  ---                                                             
7181:       0  0.6211525  0.6211525 16.32676  4.07523918      9586
7182:       0  0.6211525  0.6211525 16.32676 -1.53276157      9586
7183:       0  0.6211525  0.6211525 16.32676  3.31423950      9586
7184:       0  0.6211525  0.6211525 16.32676 -0.08576202      9586
7185:       0  0.6211525  0.6211525 16.32676  6.40623856      9586
          ses_mn female_mn     ses_dev female_dev    ses_mn2
   1: -0.4343830 0.5957447 -1.09361702  0.4042553 -0.4343830
   2: -0.4343830 0.5957447 -0.15361702  0.4042553 -0.4343830
   3: -0.4343830 0.5957447 -0.09361702 -0.5957447 -0.4343830
   4: -0.4343830 0.5957447 -0.23361701 -0.5957447 -0.4343830
   5: -0.4343830 0.5957447  0.27638297 -0.5957447 -0.4343830
  ---                                                       
7181:  0.6211525 1.0000000  0.89084742  0.0000000  0.6211525
7182:  0.6211525 1.0000000 -0.65915254  0.0000000  0.6211525
7183:  0.6211525 1.0000000  0.71084748  0.0000000  0.6211525
7184:  0.6211525 1.0000000 -0.62915254  0.0000000  0.6211525
7185:  0.6211525 1.0000000  0.17084745  0.0000000  0.6211525

hsbdt[ , ses_dev2 := ses - ses_mn2]

      minority female    ses mathach size sector pracad disclim
   1:        0      1 -1.528   5.876  842      0   0.35   1.597
   2:        0      1 -0.588  19.708  842      0   0.35   1.597
   3:        0      0 -0.528  20.349  842      0   0.35   1.597
   4:        0      0 -0.668   8.781  842      0   0.35   1.597
   5:        0      0 -0.158  17.898  842      0   0.35   1.597
  ---                                                          
7181:        0      1  1.512  20.402  262      1   1.00  -2.416
7182:        0      1 -0.038  14.794  262      1   1.00  -2.416
7183:        0      1  1.332  19.641  262      1   1.00  -2.416
7184:        0      1 -0.008  16.241  262      1   1.00  -2.416
7185:        0      1  0.792  22.733  262      1   1.00  -2.416
      himinty schoolid      mean       sd    sdalt       junk
   1:       0     1224  9.715446 7.592785 6.256328  45.710770
   2:       0     1224  9.715446 7.592785 6.256328  49.999413
   3:       0     1224  9.715446 7.592785 6.256328  59.475361
   4:       0     1224  9.715446 7.592785 6.256328  14.868533
   5:       0     1224  9.715446 7.592785 6.256328  27.678404
  ---                                                        
7181:       0     9586 14.863695 6.415999 6.256328  60.295639
7182:       0     9586 14.863695 6.415999 6.256328   4.652761
7183:       0     9586 14.863695 6.415999 6.256328  49.056393
7184:       0     9586 14.863695 6.415999 6.256328  12.988998
7185:       0     9586 14.863695 6.415999 6.256328 101.929749
        sdalt2 num       se     sealt    sealt2       t2    t2alt
   1: 48.39363  47 1.107522 0.9125792 1.0147176 6.958498 8.289523
   2: 48.39363  47 1.107522 0.9125792 1.0147176 6.958498 8.289523
   3: 48.39363  47 1.107522 0.9125792 1.0147176 6.958498 8.289523
   4: 48.39363  47 1.107522 0.9125792 1.0147176 6.958498 8.289523
   5: 48.39363  47 1.107522 0.9125792 1.0147176 6.958498 8.289523
  ---                                                            
7181: 48.39363  59 0.835292 0.8145045 0.9056661 7.106490 6.044989
7182: 48.39363  59 0.835292 0.8145045 0.9056661 7.106490 6.044989
7183: 48.39363  59 0.835292 0.8145045 0.9056661 7.106490 6.044989
7184: 48.39363  59 0.835292 0.8145045 0.9056661 7.106490 6.044989
7185: 48.39363  59 0.835292 0.8145045 0.9056661 7.106490 6.044989
      pickone      mmses      mnses       xb       resid schoolidf
   1:       1 -0.4343830 -0.4343830 10.13759 -4.26158857      1224
   2:       0 -0.4343830 -0.4343830 10.13759  9.57041168      1224
   3:       0 -0.4343830 -0.4343830 10.13759 10.21141243      1224
   4:       0 -0.4343830 -0.4343830 10.13759 -1.35658836      1224
   5:       0 -0.4343830 -0.4343830 10.13759  7.76041222      1224
  ---                                                             
7181:       0  0.6211525  0.6211525 16.32676  4.07523918      9586
7182:       0  0.6211525  0.6211525 16.32676 -1.53276157      9586
7183:       0  0.6211525  0.6211525 16.32676  3.31423950      9586
7184:       0  0.6211525  0.6211525 16.32676 -0.08576202      9586
7185:       0  0.6211525  0.6211525 16.32676  6.40623856      9586
          ses_mn female_mn     ses_dev female_dev    ses_mn2
   1: -0.4343830 0.5957447 -1.09361702  0.4042553 -0.4343830
   2: -0.4343830 0.5957447 -0.15361702  0.4042553 -0.4343830
   3: -0.4343830 0.5957447 -0.09361702 -0.5957447 -0.4343830
   4: -0.4343830 0.5957447 -0.23361701 -0.5957447 -0.4343830
   5: -0.4343830 0.5957447  0.27638297 -0.5957447 -0.4343830
  ---                                                       
7181:  0.6211525 1.0000000  0.89084742  0.0000000  0.6211525
7182:  0.6211525 1.0000000 -0.65915254  0.0000000  0.6211525
7183:  0.6211525 1.0000000  0.71084748  0.0000000  0.6211525
7184:  0.6211525 1.0000000 -0.62915254  0.0000000  0.6211525
7185:  0.6211525 1.0000000  0.17084745  0.0000000  0.6211525
         ses_dev2
   1: -1.09361702
   2: -0.15361702
   3: -0.09361702
   4: -0.23361701
   5:  0.27638297
  ---            
7181:  0.89084742
7182: -0.65915254
7183:  0.71084748
7184: -0.62915254
7185:  0.17084745

hsbnew <- as.data.frame(hsbdt)
head(hsbnew[ , c("schoolidf", "ses", "ses_mn2", "ses_dev2")])

  schoolidf    ses   ses_mn2    ses_dev2
1      1224 -1.528 -0.434383 -1.09361702
2      1224 -0.588 -0.434383 -0.15361702
3      1224 -0.528 -0.434383 -0.09361702
4      1224 -0.668 -0.434383 -0.23361701
5      1224 -0.158 -0.434383  0.27638297
6      1224  0.022 -0.434383  0.45638298

The beauty of the data.table is that memory is used efficiently. The new columns, ses_mn2 and ses_dev2, are added without re-copying the rest of the data. In fact, we could snug these 2 column creations into one command, thus reducing the memory allocation to a single step.

The peril of transform()

The base R offers a function called transform which seems, at least in theory, to offer an avenue to efficiently append the column results onto the original data frame. This one is actually not more efficient, it is simply different syntax. It seems to some to offer the best of both worlds.

xxx <- transform(hsb, ses_mn = xx1$ses_mn, ses_dev = xx1$ses_dev)

What’s the problem? Well, it is not faster, and it is dangerous. There is a stern warning about the transform function. It is among the most clear cautions in all of the R documentation. Run “?transform”, look for “Warning”. The help page says

“This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting arithmetic functions, and in particular the non-standard evaluation of argument ‘transform’ can have unanticipated consequences.”

Speaking sarcastically, except for that warning, the transform function looks like a good way to go at this.

Enter mutate

Hadley Wickham has offered a number of R packages and new idioms. Like data.table, his `plyr package, and its successor dplyr, are intended to make R code both more “elegant” (in the eyes of the author, at least), faster, and less error prone. The plyr package implements an approach that is described as the “split, apply, combine” paradigm. It is summarized in this article:

Wickham, H. (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1 - 29. doi:http://dx.doi.org/10.18637/jss.v040.i01

(https://www.jstatsoft.org/article/view/v040i01/v40i01.pdf)

The plyr package supplies a function called mutate, a replacement for transform.

library(plyr)
hsb2 <- ddply(hsb, "schoolidf", mutate, ses_mn = mean(ses, na.rm = TRUE), ses_dev = ses - ses_mn)
head(hsb2)

  minority female    ses mathach size sector pracad disclim himinty
1        0      1 -1.528   5.876  842      0   0.35   1.597       0
2        0      1 -0.588  19.708  842      0   0.35   1.597       0
3        0      0 -0.528  20.349  842      0   0.35   1.597       0
4        0      0 -0.668   8.781  842      0   0.35   1.597       0
5        0      0 -0.158  17.898  842      0   0.35   1.597       0
6        0      0  0.022   4.583  842      0   0.35   1.597       0
  schoolid     mean       sd    sdalt     junk   sdalt2 num       se
1     1224 9.715446 7.592785 6.256328 45.71077 48.39363  47 1.107522
2     1224 9.715446 7.592785 6.256328 49.99941 48.39363  47 1.107522
3     1224 9.715446 7.592785 6.256328 59.47536 48.39363  47 1.107522
4     1224 9.715446 7.592785 6.256328 14.86853 48.39363  47 1.107522
5     1224 9.715446 7.592785 6.256328 27.67840 48.39363  47 1.107522
6     1224 9.715446 7.592785 6.256328 64.86649 48.39363  47 1.107522
      sealt   sealt2       t2    t2alt pickone     mmses     mnses
1 0.9125792 1.014718 6.958498 8.289523       1 -0.434383 -0.434383
2 0.9125792 1.014718 6.958498 8.289523       0 -0.434383 -0.434383
3 0.9125792 1.014718 6.958498 8.289523       0 -0.434383 -0.434383
4 0.9125792 1.014718 6.958498 8.289523       0 -0.434383 -0.434383
5 0.9125792 1.014718 6.958498 8.289523       0 -0.434383 -0.434383
6 0.9125792 1.014718 6.958498 8.289523       0 -0.434383 -0.434383
        xb     resid schoolidf    ses_mn female_mn     ses_dev
1 10.13759 -4.261589      1224 -0.434383 0.5957447 -1.09361702
2 10.13759  9.570412      1224 -0.434383 0.5957447 -0.15361702
3 10.13759 10.211412      1224 -0.434383 0.5957447 -0.09361702
4 10.13759 -1.356588      1224 -0.434383 0.5957447 -0.23361701
5 10.13759  7.760412      1224 -0.434383 0.5957447  0.27638297
6 10.13759 -5.554588      1224 -0.434383 0.5957447  0.45638298
  female_dev
1  0.4042553
2  0.4042553
3 -0.5957447
4 -0.5957447
5 -0.5957447
6 -0.5957447

If one enjoys the “pipe” notation that is supported by the newer dplyr (which I do not), the same is written as

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

hsb3 <- hsb %>% group_by(schoolidf) %>% mutate(ses_mn = mean(ses, na.rm=TRUE), ses_dev = ses - ses_mn)
hsb3[ , c('schoolidf', 'ses', 'ses_mn', 'ses_dev')]

Source: local data frame [7,185 x 4]
Groups: schoolidf [160]

   schoolidf    ses    ses_mn     ses_dev
      <fctr>  <dbl>     <dbl>       <dbl>
1       1224 -1.528 -0.434383 -1.09361702
2       1224 -0.588 -0.434383 -0.15361702
3       1224 -0.528 -0.434383 -0.09361702
4       1224 -0.668 -0.434383 -0.23361701
5       1224 -0.158 -0.434383  0.27638297
6       1224  0.022 -0.434383  0.45638298
7       1224 -0.618 -0.434383 -0.18361699
8       1224 -0.998 -0.434383 -0.56361705
9       1224 -0.888 -0.434383 -0.45361703
10      1224 -0.458 -0.434383 -0.02361703
# ... with 7,175 more rows

table(hsb3$ses_mn == hsb$ses_mn)


TRUE 
7185

R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.10

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] dplyr_0.5.0   knitr_1.15.1  rmarkdown_1.3

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.9       digest_0.6.12     rprojroot_1.2    
 [4] assertthat_0.1    R6_2.2.0          plyr_1.8.4       
 [7] DBI_0.5-1         backports_1.0.5   magrittr_1.5     
[10] evaluate_0.10     stringi_1.1.2     lazyeval_0.2.0   
[13] data.table_1.10.4 tools_3.3.2       stringr_1.1.0    
[16] yaml_2.1.14       htmltools_0.3.5   tibble_1.2

Available under Created Commons license 3.0 CC BY

Group Means, Group Mean Centered Observations

Multilevel recoding chores #1

Paul Johnson

CRMDA Guide #43. Please visit http://crmda.ku.edu/guides for updates.

Tags: guides, R, multilevel models

2017 March 01