Abstract
The most common recoding chore that comes with multilevel models is creating the group mean and individual deviations within group variables. This describes how that can be done with tools provided in the base of R, as well as in some addon packages.Multilevel modelers might need group-level summary information for various reasons. They often want to change a multilevel regression from this
\[\begin{equation} y_{ji} = \beta_0 + \beta_1 x_{ji} + \epsilon_{ji} \label{eq:eq1} \end{equation}\]to
\[ y_{ji} = \beta_0 + \beta_1 \bar{x}_{j.} + \beta_2 \widetilde{x}_{ji} + \epsilon_{ji} \label{eq:eq2} \]
where \(\bar{x}_{j.}\) is the mean of \(x_{ji}\) within group \(j\) and \(\widetilde{x}_{ji} = x_{ji} -\bar{x}_{j.}\). The variable \(\widetilde{x}_{ji}\) is the “group mean-centered” value of \(x_{ji}\).
Changing the specification from the first equation to the second equation is a way of checking for the model’s specification. If the effect of changing the within group mean is markedly different from changing the value of the individual’s score, then the ecological fallacy might be in evidence. This specification is widely known as “Mundlak’s suggestion” for exploring the effect of a numeric predictor at the contextual and individual levels.
There is an extended discussion of this coding in Snijders and Bosker(). It turns out that the preceeding model is statistically equivalent to this,
\[ y_{ji} = \beta_0 + \beta_3 \bar{x}_{j.} + \beta_4 x_{ji} + \epsilon_{ji} \label{eq:eq3} \]
This model appears to be more simple because it includes the individual variable, \(x_{ji}\), as a predictor, rather than \(\widetilde{x}_{ji}\). However, this fit is statistically equivalent. The fitted regression lines are identical and the difference between the fits is a bit of an illusion. All of the information contained in one is actually present in the other.
\[\begin{align} y_{ji} & = \beta_0 + \beta_1 \bar{x}_{j.} + \beta_2 \widetilde{x}_{ji} + \epsilon_{ji}\\ & = \beta_0 + \beta_1 \bar{x}_{j.} + \beta_2( x_{ji} -\bar{x}_{j.}) + \epsilon_{ji}\\ & = \beta_0 + (\beta_1-\beta_2)\bar{x}_{j.} + \beta_2(x_{ji}) + \epsilon_{ji} \end{align}\]Clearly, the “individual level effect” is the same in both models, \(\beta_2 = \beta_4\). And, equally clearly, the superficial difference we see in the parameter estimates is easily understood as \(\beta_3 = (\beta_1-\beta_2)\).
While these are equivalent models, estimating a model with \(\widetilde{x}_{ji}\), the individual deviations within groups, is considered to be suggestive by some authors. It is a way of explicitly estimationg a “big fish in a small pond” effect, as if to say that individual variations about the group mean are somehow more interesting that individual variations themselves.
In Raudenbush and Bryk (), the variable \(\widetilde{x}_{ji}\) is given a random intercept. That is an entirely different specification, of course.
In any case, the group means are necessary for the analysis and, for some purposes, the “group mean centered” individual data is thought-provoking.
The High School and Beyond data set summarizes mathematical and English language achievement for students. I have instructions about downloading and importing that data on this web page:
http://pj.freefaculty.org/guides/stat/DataSets/HSB/00-README.txt
The following will download a copy of the data if one is not already available in the working directory.
if (file.exists("hsb.rds")){
hsb <- readRDS("hsb.rds")
} else {
library(foreign)
hsb.url <- "http://www.stata-press.com/data/mlmus3/hsb.dta"
hsb <- read.dta(hsb.url)
saveRDS(hsb, "hsb.rds")
}
hsb$schoolidf <- factor(hsb$schoolid)
If a variable is named ses
and we want to calculate the mean for each school, we run
egen ses_mn = mean(ses), by(schoolid)
The deviation within group is easy to calculate after that.
gen ses_dev = ses - ses_mn
In Stata, one is accustomed to the idea that the data frame will be altered by the addition of a new variable.
In the Rabe-Hesketh and Skrondal book, a method is describe to use a for loop to iterate through many variables and create group mean columns for them.
Here we can practice by getting the group means for the variables ses and female.
foreach var of varlist ses female {
egen `var'_mn = mean(`var'), by(schoolid)
}
foreach var of varlist ses female {
gen `var'_dev = `var' - `var'_mn
}
R offers many functions in the base that can produce the group means. The more difficult challenge is to “duplicate” the means back onto the data frame so that we can put them to use.
aggregate
and by
are convenience functions in the R base that are intended to simplify the use of tapply
.
ses_mn1 <- aggregate(hsb[, "ses", drop=FALSE], by = list(schoolid = hsb$schoolidf), mean, na.rm = TRUE)
head(ses_mn1)
schoolid ses
1 1224 -0.43438298
2 1288 0.12159999
3 1296 -0.42550000
4 1308 0.52800000
5 1317 0.34533333
6 1358 -0.01966667
str(ses_mn1)
'data.frame': 160 obs. of 2 variables:
$ schoolid: Factor w/ 160 levels "1224","1288",..: 1 2 3 4 5 6 7 8 9 10 ...
$ ses : num -0.434 0.122 -0.426 0.528 0.345 ...
class(ses_mn1)
[1] "data.frame"
ses_mn2 <- by(hsb$ses, hsb$schoolidf, mean, na.rm = TRUE)
head(ses_mn2)
hsb$schoolidf
1224 1288 1296 1308 1317
-0.43438298 0.12159999 -0.42550000 0.52800000 0.34533333
1358
-0.01966667
str(ses_mn2)
by [1:160(1d)] -0.434 0.122 -0.426 0.528 0.345 ...
- attr(*, "dimnames")=List of 1
..$ hsb$schoolidf: chr [1:160] "1224" "1288" "1296" "1308" ...
- attr(*, "call")= language by.default(data = hsb$ses, INDICES = hsb$schoolidf, FUN = mean, na.rm = TRUE)
class(ses_mn2)
[1] "by"
ses_mn3 <- tapply(hsb$ses, hsb$schoolidf, mean, na.rm = TRUE)
head(ses_mn3)
1224 1288 1296 1308 1317
-0.43438298 0.12159999 -0.42550000 0.52800000 0.34533333
1358
-0.01966667
str(ses_mn3)
num [1:160(1d)] -0.434 0.122 -0.426 0.528 0.345 ...
- attr(*, "dimnames")=List of 1
..$ : chr [1:160] "1224" "1288" "1296" "1308" ...
class(ses_mn3)
[1] "array"
Using any of these methods, we are able to retrieve the means of the groups.
The challenge is to put those means back onto the data frame. It is easy to get this wrong because the results from tapply do not come back in the expected order. Here is a toy example to demonstrate that.
toy <- data.frame(x1 = rnorm(13),
x2 = c(rep("Z", 3), rep("P", 2), rep("A", 3), rep("C", 5)),
stringsAsFactors = FALSE)
toy
x1 x2
1 1.68896279 Z
2 -1.70367563 Z
3 0.05274744 Z
4 -0.38545901 P
5 -0.68566177 P
6 -1.14125006 A
7 -0.43040873 A
8 1.95247009 A
9 0.38286307 C
10 -0.61537067 C
11 0.57555937 C
12 0.02824677 C
13 -1.74853165 C
x1_mn <- tapply(toy$x1, toy$x2, mean)
x1_mn
A C P Z
0.1269371 -0.2754466 -0.5355604 0.0126782
Note that the ordering from top to bottom in the data frame has groups Z, P, A and C, the ordering of tapply is different. tapply creates a factor variable, which, by default, uses the available values in alphabetical order as the levels. Hence, a decision to manufacture a vector by the obvious sort of trick will result in a ghastly error.
To put the data “back onto” the data frame requires the careful use of index subscripts or the merge function.
x1_mn <- aggregate(toy[ , "x1", drop = FALSE], by = list("x2" = toy$x2), mean, na.rm = TRUE)
x1_mn
x2 x1
1 A 0.1269371
2 C -0.2754466
3 P -0.5355604
4 Z 0.0126782
toy2 <- merge(toy, x1_mn, by = "x2", suffix = c("", "_mn"), sort = FALSE)
toy2
x2 x1 x1_mn
1 Z 1.68896279 0.0126782
2 Z -1.70367563 0.0126782
3 Z 0.05274744 0.0126782
4 P -0.38545901 -0.5355604
5 P -0.68566177 -0.5355604
6 A -1.14125006 0.1269371
7 A -0.43040873 0.1269371
8 A 1.95247009 0.1269371
9 C 0.38286307 -0.2754466
10 C -0.61537067 -0.2754466
11 C 0.57555937 -0.2754466
12 C 0.02824677 -0.2754466
13 C -1.74853165 -0.2754466
Variable “mmses” is the mean of ses, within schools, which we verify here Create a vector of variable names for which we need school level means
vars <- c("ses", "female")
ses_mn <- aggregate(hsb[ , vars],
by = list("schoolidf" = hsb$schoolidf), mean, na.rm = TRUE)
hsb2 <- merge(hsb, ses_mn, by = "schoolidf", suffix = c("", "_mn"))
for(i in vars){
hsb2[ , paste0(i, "_dev")] <- hsb2[ , i] - hsb2[ , paste0(i, "_mn")]
}
##' Generate group summaries and individual deviations within groups
##'
##' Similar to Stata egen, except more versatile and fun! Will
##' create a new data frame with 2 columns. Rows match
##' the rows of the original data frame.
##'
##' Now I return only the new columns
##' @param dframe a data frame
##' @param x Variable names or a vector of variable names
##' @param by A grouping variable name or a vector of grouping names
##' @param FUN Defaults to the mean, have not tested alternatives
##' @param suffix The suffixes to be added to column 1 and column 2
##' @return new data frame with "x_mn" and "x_dev" added as variables
##' @author Paul Johnson
##' @examples
##' Suppose you get the MLMUS hsb data frame, somehow
##' xx1 <- gmd(hsb, "ses", "schoolidf")
##' xx2 <- gmd(hsb, c("ses", "female"), "schoolidf")
##' xx3 <- gmd(hsb, c("ses", "female"), c("schoolidf", "sector"))
##' xx4 <- gmd(hsb, c("ses", "female"),
##' c("schoolidf"), FUN = median)
gmd <- function(dframe, x, by, FUN = mean, suffix = c("_mn", "_dev")){
xmean <- aggregate(dframe[ , x, drop = FALSE],
dframe[ , by, drop = FALSE], FUN,
na.rm = TRUE)
df2 <- merge(dframe, xmean, by = by, suffix = c("", suffix[1]))
for(i in x){
df2[ , paste0(i, suffix[2])] <- df2[ , i] - df2[ , paste0(i, suffix[1])]
}
df2[ , colnames(df2)[!colnames(df2) %in% colnames(dframe)]]
}
xx1 <- gmd(hsb, "ses", "schoolidf")
head(xx1)
ses_mn ses_dev
1 -0.434383 -1.09361702
2 -0.434383 -0.15361702
3 -0.434383 -0.09361702
4 -0.434383 -0.23361701
5 -0.434383 0.27638297
6 -0.434383 0.45638298
## Could combine with hsb
## hsb <- cbind(hsb, xx1)
## head(hsb[ , c("schoolidf", "ses", "ses_mn", "ses_dev")])
xx2 <- gmd(hsb, c("ses", "female"), "schoolidf")
head(xx2)
ses_mn female_mn ses_dev female_dev
1 -0.434383 0.5957447 -1.09361702 0.4042553
2 -0.434383 0.5957447 -0.15361702 0.4042553
3 -0.434383 0.5957447 -0.09361702 -0.5957447
4 -0.434383 0.5957447 -0.23361701 -0.5957447
5 -0.434383 0.5957447 0.27638297 -0.5957447
6 -0.434383 0.5957447 0.45638298 -0.5957447
hsb <- cbind(hsb, xx2)
head(hsb[ , c("schoolidf", "ses", "ses_mn", "ses_dev", "female", "female_mn", "female_dev")])
schoolidf ses ses_mn ses_dev female female_mn female_dev
1 1224 -1.528 -0.434383 -1.09361702 1 0.5957447 0.4042553
2 1224 -0.588 -0.434383 -0.15361702 1 0.5957447 0.4042553
3 1224 -0.528 -0.434383 -0.09361702 0 0.5957447 -0.5957447
4 1224 -0.668 -0.434383 -0.23361701 0 0.5957447 -0.5957447
5 1224 -0.158 -0.434383 0.27638297 0 0.5957447 -0.5957447
6 1224 0.022 -0.434383 0.45638298 0 0.5957447 -0.5957447
This function, which uses only R base tools, is not as fast as possible. The data set is is only 7185 rows with 26 columns, not big data by any sense of the imagination.
If we could simply insert a column into a data.frame without causing a full memory re-write of the data.frame within R, it would be faster. The fear of excess data copying is what fuels innovations during the past 5-10 years, such as the packages data.table
plyr
and dplyr
.
All of these packages sit on top of the famous R package “Rcpp”, which is rightly praised as a revolutionary advance in R for high performance problems. Unfortunately, it is also one of the sources of periodic project failures because certain data sets expose bugs (which are fixed eventually, usually quickly).
Simply put, relying on more contributed packages can cause project failures. Recently, we have had 2 projects come to a halt because of little bugs in packages that were relied upon by other packages that we were using. These problems get fixed, eventually, but why complicate the workflow unnecessarily.
When data sets are truly immense, the data.table
package offers the only tools I’ve found to quickly sort, merge, and calculate-by-subgroup. data.table
introduces an entirely new style of R programming, touted by its advocates as more “elegant” and “understandable”. The CRMDA has a data.table
guide available in our collection.
Here is an example of data.table in action.
library(data.table)
hsbdt <- as.data.table(hsb)
setkey(hsbdt, schoolidf)
hsbdt[ , ses_mn2 := mean(ses, na.rm = TRUE), by = schoolidf]
minority female ses mathach size sector pracad disclim
1: 0 1 -1.528 5.876 842 0 0.35 1.597
2: 0 1 -0.588 19.708 842 0 0.35 1.597
3: 0 0 -0.528 20.349 842 0 0.35 1.597
4: 0 0 -0.668 8.781 842 0 0.35 1.597
5: 0 0 -0.158 17.898 842 0 0.35 1.597
---
7181: 0 1 1.512 20.402 262 1 1.00 -2.416
7182: 0 1 -0.038 14.794 262 1 1.00 -2.416
7183: 0 1 1.332 19.641 262 1 1.00 -2.416
7184: 0 1 -0.008 16.241 262 1 1.00 -2.416
7185: 0 1 0.792 22.733 262 1 1.00 -2.416
himinty schoolid mean sd sdalt junk
1: 0 1224 9.715446 7.592785 6.256328 45.710770
2: 0 1224 9.715446 7.592785 6.256328 49.999413
3: 0 1224 9.715446 7.592785 6.256328 59.475361
4: 0 1224 9.715446 7.592785 6.256328 14.868533
5: 0 1224 9.715446 7.592785 6.256328 27.678404
---
7181: 0 9586 14.863695 6.415999 6.256328 60.295639
7182: 0 9586 14.863695 6.415999 6.256328 4.652761
7183: 0 9586 14.863695 6.415999 6.256328 49.056393
7184: 0 9586 14.863695 6.415999 6.256328 12.988998
7185: 0 9586 14.863695 6.415999 6.256328 101.929749
sdalt2 num se sealt sealt2 t2 t2alt
1: 48.39363 47 1.107522 0.9125792 1.0147176 6.958498 8.289523
2: 48.39363 47 1.107522 0.9125792 1.0147176 6.958498 8.289523
3: 48.39363 47 1.107522 0.9125792 1.0147176 6.958498 8.289523
4: 48.39363 47 1.107522 0.9125792 1.0147176 6.958498 8.289523
5: 48.39363 47 1.107522 0.9125792 1.0147176 6.958498 8.289523
---
7181: 48.39363 59 0.835292 0.8145045 0.9056661 7.106490 6.044989
7182: 48.39363 59 0.835292 0.8145045 0.9056661 7.106490 6.044989
7183: 48.39363 59 0.835292 0.8145045 0.9056661 7.106490 6.044989
7184: 48.39363 59 0.835292 0.8145045 0.9056661 7.106490 6.044989
7185: 48.39363 59 0.835292 0.8145045 0.9056661 7.106490 6.044989
pickone mmses mnses xb resid schoolidf
1: 1 -0.4343830 -0.4343830 10.13759 -4.26158857 1224
2: 0 -0.4343830 -0.4343830 10.13759 9.57041168 1224
3: 0 -0.4343830 -0.4343830 10.13759 10.21141243 1224
4: 0 -0.4343830 -0.4343830 10.13759 -1.35658836 1224
5: 0 -0.4343830 -0.4343830 10.13759 7.76041222 1224
---
7181: 0 0.6211525 0.6211525 16.32676 4.07523918 9586
7182: 0 0.6211525 0.6211525 16.32676 -1.53276157 9586
7183: 0 0.6211525 0.6211525 16.32676 3.31423950 9586
7184: 0 0.6211525 0.6211525 16.32676 -0.08576202 9586
7185: 0 0.6211525 0.6211525 16.32676 6.40623856 9586
ses_mn female_mn ses_dev female_dev ses_mn2
1: -0.4343830 0.5957447 -1.09361702 0.4042553 -0.4343830
2: -0.4343830 0.5957447 -0.15361702 0.4042553 -0.4343830
3: -0.4343830 0.5957447 -0.09361702 -0.5957447 -0.4343830
4: -0.4343830 0.5957447 -0.23361701 -0.5957447 -0.4343830
5: -0.4343830 0.5957447 0.27638297 -0.5957447 -0.4343830
---
7181: 0.6211525 1.0000000 0.89084742 0.0000000 0.6211525
7182: 0.6211525 1.0000000 -0.65915254 0.0000000 0.6211525
7183: 0.6211525 1.0000000 0.71084748 0.0000000 0.6211525
7184: 0.6211525 1.0000000 -0.62915254 0.0000000 0.6211525
7185: 0.6211525 1.0000000 0.17084745 0.0000000 0.6211525
hsbdt[ , ses_dev2 := ses - ses_mn2]
minority female ses mathach size sector pracad disclim
1: 0 1 -1.528 5.876 842 0 0.35 1.597
2: 0 1 -0.588 19.708 842 0 0.35 1.597
3: 0 0 -0.528 20.349 842 0 0.35 1.597
4: 0 0 -0.668 8.781 842 0 0.35 1.597
5: 0 0 -0.158 17.898 842 0 0.35 1.597
---
7181: 0 1 1.512 20.402 262 1 1.00 -2.416
7182: 0 1 -0.038 14.794 262 1 1.00 -2.416
7183: 0 1 1.332 19.641 262 1 1.00 -2.416
7184: 0 1 -0.008 16.241 262 1 1.00 -2.416
7185: 0 1 0.792 22.733 262 1 1.00 -2.416
himinty schoolid mean sd sdalt junk
1: 0 1224 9.715446 7.592785 6.256328 45.710770
2: 0 1224 9.715446 7.592785 6.256328 49.999413
3: 0 1224 9.715446 7.592785 6.256328 59.475361
4: 0 1224 9.715446 7.592785 6.256328 14.868533
5: 0 1224 9.715446 7.592785 6.256328 27.678404
---
7181: 0 9586 14.863695 6.415999 6.256328 60.295639
7182: 0 9586 14.863695 6.415999 6.256328 4.652761
7183: 0 9586 14.863695 6.415999 6.256328 49.056393
7184: 0 9586 14.863695 6.415999 6.256328 12.988998
7185: 0 9586 14.863695 6.415999 6.256328 101.929749
sdalt2 num se sealt sealt2 t2 t2alt
1: 48.39363 47 1.107522 0.9125792 1.0147176 6.958498 8.289523
2: 48.39363 47 1.107522 0.9125792 1.0147176 6.958498 8.289523
3: 48.39363 47 1.107522 0.9125792 1.0147176 6.958498 8.289523
4: 48.39363 47 1.107522 0.9125792 1.0147176 6.958498 8.289523
5: 48.39363 47 1.107522 0.9125792 1.0147176 6.958498 8.289523
---
7181: 48.39363 59 0.835292 0.8145045 0.9056661 7.106490 6.044989
7182: 48.39363 59 0.835292 0.8145045 0.9056661 7.106490 6.044989
7183: 48.39363 59 0.835292 0.8145045 0.9056661 7.106490 6.044989
7184: 48.39363 59 0.835292 0.8145045 0.9056661 7.106490 6.044989
7185: 48.39363 59 0.835292 0.8145045 0.9056661 7.106490 6.044989
pickone mmses mnses xb resid schoolidf
1: 1 -0.4343830 -0.4343830 10.13759 -4.26158857 1224
2: 0 -0.4343830 -0.4343830 10.13759 9.57041168 1224
3: 0 -0.4343830 -0.4343830 10.13759 10.21141243 1224
4: 0 -0.4343830 -0.4343830 10.13759 -1.35658836 1224
5: 0 -0.4343830 -0.4343830 10.13759 7.76041222 1224
---
7181: 0 0.6211525 0.6211525 16.32676 4.07523918 9586
7182: 0 0.6211525 0.6211525 16.32676 -1.53276157 9586
7183: 0 0.6211525 0.6211525 16.32676 3.31423950 9586
7184: 0 0.6211525 0.6211525 16.32676 -0.08576202 9586
7185: 0 0.6211525 0.6211525 16.32676 6.40623856 9586
ses_mn female_mn ses_dev female_dev ses_mn2
1: -0.4343830 0.5957447 -1.09361702 0.4042553 -0.4343830
2: -0.4343830 0.5957447 -0.15361702 0.4042553 -0.4343830
3: -0.4343830 0.5957447 -0.09361702 -0.5957447 -0.4343830
4: -0.4343830 0.5957447 -0.23361701 -0.5957447 -0.4343830
5: -0.4343830 0.5957447 0.27638297 -0.5957447 -0.4343830
---
7181: 0.6211525 1.0000000 0.89084742 0.0000000 0.6211525
7182: 0.6211525 1.0000000 -0.65915254 0.0000000 0.6211525
7183: 0.6211525 1.0000000 0.71084748 0.0000000 0.6211525
7184: 0.6211525 1.0000000 -0.62915254 0.0000000 0.6211525
7185: 0.6211525 1.0000000 0.17084745 0.0000000 0.6211525
ses_dev2
1: -1.09361702
2: -0.15361702
3: -0.09361702
4: -0.23361701
5: 0.27638297
---
7181: 0.89084742
7182: -0.65915254
7183: 0.71084748
7184: -0.62915254
7185: 0.17084745
hsbnew <- as.data.frame(hsbdt)
head(hsbnew[ , c("schoolidf", "ses", "ses_mn2", "ses_dev2")])
schoolidf ses ses_mn2 ses_dev2
1 1224 -1.528 -0.434383 -1.09361702
2 1224 -0.588 -0.434383 -0.15361702
3 1224 -0.528 -0.434383 -0.09361702
4 1224 -0.668 -0.434383 -0.23361701
5 1224 -0.158 -0.434383 0.27638297
6 1224 0.022 -0.434383 0.45638298
The beauty of the data.table is that memory is used efficiently. The new columns, ses_mn2 and ses_dev2, are added without re-copying the rest of the data. In fact, we could snug these 2 column creations into one command, thus reducing the memory allocation to a single step.
The base R offers a function called transform
which seems, at least in theory, to offer an avenue to efficiently append the column results onto the original data frame. This one is actually not more efficient, it is simply different syntax. It seems to some to offer the best of both worlds.
xxx <- transform(hsb, ses_mn = xx1$ses_mn, ses_dev = xx1$ses_dev)
What’s the problem? Well, it is not faster, and it is dangerous. There is a stern warning about the transform function. It is among the most clear cautions in all of the R documentation. Run “?transform”, look for “Warning”. The help page says
“This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting arithmetic functions, and in particular the non-standard evaluation of argument ‘transform’ can have unanticipated consequences.”
Speaking sarcastically, except for that warning, the transform function looks like a good way to go at this.
Hadley Wickham has offered a number of R packages and new idioms. Like data.table
, his `plyr
package, and its successor dplyr
, are intended to make R code both more “elegant” (in the eyes of the author, at least), faster, and less error prone. The plyr
package implements an approach that is described as the “split, apply, combine” paradigm. It is summarized in this article:
Wickham, H. (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1 - 29. doi:http://dx.doi.org/10.18637/jss.v040.i01
(https://www.jstatsoft.org/article/view/v040i01/v40i01.pdf)
The plyr package supplies a function called mutate
, a replacement for transform.
library(plyr)
hsb2 <- ddply(hsb, "schoolidf", mutate, ses_mn = mean(ses, na.rm = TRUE), ses_dev = ses - ses_mn)
head(hsb2)
minority female ses mathach size sector pracad disclim himinty
1 0 1 -1.528 5.876 842 0 0.35 1.597 0
2 0 1 -0.588 19.708 842 0 0.35 1.597 0
3 0 0 -0.528 20.349 842 0 0.35 1.597 0
4 0 0 -0.668 8.781 842 0 0.35 1.597 0
5 0 0 -0.158 17.898 842 0 0.35 1.597 0
6 0 0 0.022 4.583 842 0 0.35 1.597 0
schoolid mean sd sdalt junk sdalt2 num se
1 1224 9.715446 7.592785 6.256328 45.71077 48.39363 47 1.107522
2 1224 9.715446 7.592785 6.256328 49.99941 48.39363 47 1.107522
3 1224 9.715446 7.592785 6.256328 59.47536 48.39363 47 1.107522
4 1224 9.715446 7.592785 6.256328 14.86853 48.39363 47 1.107522
5 1224 9.715446 7.592785 6.256328 27.67840 48.39363 47 1.107522
6 1224 9.715446 7.592785 6.256328 64.86649 48.39363 47 1.107522
sealt sealt2 t2 t2alt pickone mmses mnses
1 0.9125792 1.014718 6.958498 8.289523 1 -0.434383 -0.434383
2 0.9125792 1.014718 6.958498 8.289523 0 -0.434383 -0.434383
3 0.9125792 1.014718 6.958498 8.289523 0 -0.434383 -0.434383
4 0.9125792 1.014718 6.958498 8.289523 0 -0.434383 -0.434383
5 0.9125792 1.014718 6.958498 8.289523 0 -0.434383 -0.434383
6 0.9125792 1.014718 6.958498 8.289523 0 -0.434383 -0.434383
xb resid schoolidf ses_mn female_mn ses_dev
1 10.13759 -4.261589 1224 -0.434383 0.5957447 -1.09361702
2 10.13759 9.570412 1224 -0.434383 0.5957447 -0.15361702
3 10.13759 10.211412 1224 -0.434383 0.5957447 -0.09361702
4 10.13759 -1.356588 1224 -0.434383 0.5957447 -0.23361701
5 10.13759 7.760412 1224 -0.434383 0.5957447 0.27638297
6 10.13759 -5.554588 1224 -0.434383 0.5957447 0.45638298
female_dev
1 0.4042553
2 0.4042553
3 -0.5957447
4 -0.5957447
5 -0.5957447
6 -0.5957447
If one enjoys the “pipe” notation that is supported by the newer dplyr (which I do not), the same is written as
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
hsb3 <- hsb %>% group_by(schoolidf) %>% mutate(ses_mn = mean(ses, na.rm=TRUE), ses_dev = ses - ses_mn)
hsb3[ , c('schoolidf', 'ses', 'ses_mn', 'ses_dev')]
Source: local data frame [7,185 x 4]
Groups: schoolidf [160]
schoolidf ses ses_mn ses_dev
<fctr> <dbl> <dbl> <dbl>
1 1224 -1.528 -0.434383 -1.09361702
2 1224 -0.588 -0.434383 -0.15361702
3 1224 -0.528 -0.434383 -0.09361702
4 1224 -0.668 -0.434383 -0.23361701
5 1224 -0.158 -0.434383 0.27638297
6 1224 0.022 -0.434383 0.45638298
7 1224 -0.618 -0.434383 -0.18361699
8 1224 -0.998 -0.434383 -0.56361705
9 1224 -0.888 -0.434383 -0.45361703
10 1224 -0.458 -0.434383 -0.02361703
# ... with 7,175 more rows
table(hsb3$ses_mn == hsb$ses_mn)
TRUE
7185
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.10
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] dplyr_0.5.0 knitr_1.15.1 rmarkdown_1.3
loaded via a namespace (and not attached):
[1] Rcpp_0.12.9 digest_0.6.12 rprojroot_1.2
[4] assertthat_0.1 R6_2.2.0 plyr_1.8.4
[7] DBI_0.5-1 backports_1.0.5 magrittr_1.5
[10] evaluate_0.10 stringi_1.1.2 lazyeval_0.2.0
[13] data.table_1.10.4 tools_3.3.2 stringr_1.1.0
[16] yaml_2.1.14 htmltools_0.3.5 tibble_1.2
Available under Created Commons license 3.0