---
title: "Data Table Guide"
subtitle: "Basic Usage and Advanced Topics"
author:
- name: Jeremy Burnison
affiliation: CRMDA, KU
email: jburnison@ku.edu
- name: Paul Johnson
affiliation: CRMDA, KU
email: pauljohn@ku.edu
advertise: >
Please visit [http://crmda.ku.edu/guides](http://crmda.ku.edu/guides)
keywords: guide, data.table, SQL
abstract: >
This is an introduction to the data.table package in R.
data.table is a high performance replacement for R's built-in
data.frame. There are significant differences in the syntax for
usage. Proponents of data.table prefer its simpler notation and ease
of use. While the style differences are noteworthy, for most who
consider adopting data.table, the major advantage of data.table over
the standard data.frame format is speed in processing. This makes
data.table desirable for 'big data' analysis. While data.table is an
alternative to data.frame, it is intended to be compatible with most
functions that expect to receive a data.frame object. That is to
say, data.frame-centric code will work with a data.table, but it will
run faster! Included in this tutorial are examples of indexing and
aggregating data variables using the data.table syntax. The first
example dataset used is not large so that you can see what happens to
the data with the syntax used.
checked_by: "Paul Johnson"
Note to Authors: please_dont_change_the_next 4 lines!
date: "`r format(Sys.time(), '%b. %e %Y')`"
output:
stationery::crmda_html_document:
toc: true
toc_depth: 2
highlight: haddock
theme: default
css: theme/kutils.css
template: theme/guide-boilerplate.html
guide_number: 40
logoleft: theme/logoleft.png
logoright: theme/logo-vert.png
---
```{r setup, include=FALSE}
##This Invisible Chunk is required in all CRMDA documents
outdir <- paste0("tmpout")
if (!file.exists(outdir)) dir.create(outdir, recursive = TRUE)
knitr::opts_chunk$set(echo=TRUE, comment=NA, fig.path=paste0(outdir, "/p-"))
opts.orig <- options()
par.orig <- par(no.readonly=TRUE)
options(width = 70)
```
```{r themecopy, include = FALSE}
library(stationery)
## If theme directory does not have required images or TeX files
## we need to retrieve them and put them in "theme" directory.
logos <- c(logoleft = "logoleft.png",
logoright = "logo-vert.png")
getFiles(logos, pkg = "crmda")
```
## Install and Load the Data.Table package in R
The data.table package is available on CRAN, the
default package repository. Installation can be accomplished with
`install.packages("data.table")`.
Load the data.table package into a session, just like any
other package. The data.table package prints out a
brief advertisement, as we will see.
```{r load_package}
library(data.table)
```
For replication purposes, we will set the random seed as
```{r seed}
set.seed(2345321)
```
The R session information is included the end of this guide.
## Syntax: Pooltable Example
This section will introduce the syntax used to generate a new
data.table, new variables, indexing the dataset, and creating
summaries of subsets and the whole dataset.
### Generate a data.table
Here we will create a dataset describing a set of billiard balls. The balls are
numbered 1 through 15, the weight is generated by a random normal distribution,
and the diameter is set by discrete values. Here we only define 3 values for the
diameter but recycling will fill the data.table. By defining number and weight
with 15 values we have set the length of the data.table. Defining another
variable in instantiation of a data table with fewer than this 15 length will
signal to data.table to recycle these numbers until 15 values are entered for
that variable. You can see these three numbers repeated 5 times to fill up the
diameter variable with 15 values. Try setting number to 1:16 or another range.
The syntax to create a data.table is identical to the syntax that
would be required for a data.frame, except that we replace
`data.frame` with `data.table`.
```{r make_pb_dt}
poolballs <- data.table(
number = 1:15,
weight = rnorm(15, 45.7, 0.8),
diameter = c(3, 2.9, 3.1) #shows recyling
)
poolballs
```
For comparison with the data.table version of the billiard ball
data, we'll make a data.frame that has the same information.
This is called `pooldf` for clarity.
```{r make_pb_df}
set.seed(2345321)
pooldf <- data.frame(
number = 1:15,
weight = rnorm(15, 45.7, 0.8),
diameter = c(3, 2.9, 3.1)
)
pooldf
```
We could achieve the same effect by using the function `as.data.frame`
supplied by the data.table package.
Note that the data.table object seems to have column names and row
names in the same way that a data.frame would:
```{r colnam}
colnames(poolballs)
rownames(poolballs)
```
The data.table *seems to* have column and row names as usual, but they
cannot be used in the ways we expect. This is surely one of the most
interesting (and frustrating) differences between data.tables and
data.frames. This difference, which we will illustrate with examples,
is a major source of confusion among new data.table users.
### Data Table indexing
Our poolballs object is a data.table. Using the syntax
`poolballs[i, j, by = k]` we can define what rows -- `i` -- and
columns -- `j` -- we want from our data. The third argument has many
uses. Here, we have a third argument `by`, which signals that we want
to group the results according to a variable `k`. There are many
other keywords we can use in that third position, such as `mult` and
`with`.
#### Grabbing specific rows {.bs-callout .bs-callout-orange}
With the data.table, one can ask for rows by row number. We
can retrieve rows 1 through 3 in the expected way
```{r grab_rows1}
poolballs[1:3,]
```
We can achieve the same effect by choosing rows by a logical
condition. In this case, we want the rows for which the value of
the variable `number` is less than 4. The following is equivalent.
```{r grab_rows2}
poolballs[number < 4,]
```
Users who are familiar with data.frames will be surprised that the
column name `number` does not need to be quoted. The data.frame
equivalent request would be rather more complicated in appearance:
```{r}
pooldf[pooldf$number < 4, ]
```
These examples indicate that the data.table's row-choosing square
brackets allow either integer row indices or an expression to define
a logical condition. The syntax for the data.frame alternative is
either more clear or more tedious, depending on your taste. The fact
the data.table `poolballs` is able to find the column `number` within
itself is a significant syntactical difference.
Choosing elements by rowname causes an error in a data.table. Recall
that the rownames were retrieved above, they are "1" through "15", as
characters.
```{r, eval=F}
> poolballs["1", ]
```
```
Error in `[.data.table`(poolballs, "1", ) :
When i is a data.table (or character vector),
x must be keyed (i.e. sorted, and, marked as sorted) so
data.table knows which columns to join to and take advantage
of x being sorted. Call setkey(x,...) first, see ?setkey.
```
#### Grabbing specific variables (columns) {.bs-callout .bs-callout-blue}
One of the most notable differences between a data.table and a
data.frame is that the column names must be used without quotation
marks in a data.table. In a data.frame object named `pooldf` we would
choose the column named "weight" by
```{r, eval=F}
pooldf[ , "weight"]
```
which is equivalent to
```{r, eval=F}
pooldf$weight
```
The data.table class will allow the following. Look, *no quotes*!:
```{r}
poolballs[ , weight]
```
There is also a different style to extract a colletion of columns from
a data.table object. In the case of a data.frame, we choose two
columns by name in this way
```{r, eval=F}
pooldf[ , c("weight", "diameter")]
```
while the data.table does not expect either a vector or quoted strings
as arguments. The data.table expects a list.
```{r, eval=F}
poolballs[ , list(weight, diameter)]
```
or the shorthand `.(` for the list. This syntax is adopted
from the package dplyr by Hadley Wickham.
```{r}
poolballs[ , .(weight, diameter)]
```
Grabing columns by numbered index wors as it would with data.frame,
but only in recent version of data.table.
```{r grab_cols1}
poolballs[, 1]
```
Data frames and data.tables allow access by column number, but there are good
practical arguments against that kind of coding. Fewer unexpected results
happen if we access the columns by their full name. Using variable names instead
of column index numbers encourages flexible code development and reduces the
number of user errors (i.e. when importing several datasets and one of the
columns is not in the correct order a column name call will return the correct
data while calling column 5 may return errors or include data from the incorrect
variable).
The first column is named "number" in this example.
```{r grab_cols2}
poolballs[, number]
```
The columns in a data.table can be renamed with the `setnames`
function from the data.table package. If `DT` is a data.table, we could
change the names of the columns `V1`, `V2`, and `V3` as follows.
```{r, eval=F}
setnames(DT,c("V1","V2","V3"), c("name1", "name2", "name3"))
```
Using data.frame DF changing columns names would required the following syntax
```{r, eval=F}
colnames(DT) <- c("name1", "name2", "name3")
```
This leaves the programmer or code reader blind to the previous colnames that
are being overwritten. The data.frame approach rewrites the entire data.frame
with the new colnames to memory. This highlights the major efficiency with
data.table and it's conservative use of memory. We'll explore the processing speed differences between these two structures later in this guide.
### Generating new variables within the data.table
In an ordinary R data.frame, adding a variable causes some seemingly
unnecessary re-calculations. Adding a column causes a complete
rewrite of all of the data.frame's columns. It seems more efficient
to leave the old columns where they were and simply add a new column.
This seemingly obvious efficiency was one of the motivating ideas
behind the data.table project in the first place.
#### New assignment symbol ":=" {.bs-callout .bs-callout-red}
The syntax of data.table allows the assignment of one new
variable in `DT` with
```{r, eval=F}
DT[ , Xnew := any_function(x)]
```
The symbol `:=` indicates we want to create a variable within `DT`.
`any_function` represents any `R` function acting on a variable `x`, that
already exists in `DT`. The new variable name `Xnew` must be a valid `R` column
name. Here we put a comma inside the square brackets for clarity, `[ ,` , to
remind ourselves that we are creating this new variable for all rows. Then
define the variable name and use `:=` to define the variable's value based on a
calculation of other `DT` variables. We can manufacture several variables at
once using a list syntax that will be illustrated below.
#### Example creating a new variable
Here we will define a new categorical variable in `poolballs` data.table
called `stripes`, which indicates if a certain ball is a `solid` or
`stripe`. We assign the values with the operator `:=`.
```{r new_stripes, results='hide'}
poolballs[ , stripes := ifelse(number%%2, 'solid', 'stripe')]
head(poolballs)
```
We used the `R` base `ifelse` function that lets us check a logical
condition and assign one of two values if our first argument is true
or false. The argument we use is modulo(*%%*) 2 of the `number` variable. So
every other ball will be set as a solid or strip as is typical in
billiard ball design.
It is possible to do any R computation
We'll use the `head` function to look at just the first 5 rows.
```{r new_density, results ='hide' }
poolballs[ , density := weight/(1.333*pi*diameter^3)]
```
```{r}
head(poolballs,5)
```
You can generate multiple new variables in one line by using
`list()` or the shorthand `.()`. You may noticed how often shorthand
is used along with data.table as those building and using this package
tend to write code as breif as possible. Below we use `:=` to
assign multiple variables in one line and add them to the dataset.
```{r, results = 'hide'}
poolballs[ , `:=` (Id = sample(1:250, size=15), pickles = 'whatever')]
```
```{r}
head(poolballs,5)
```
Assigning new variables in data.frames requires the `$` and assignment object
`<-`. i.e. `DF$NewVarName <- DF$weight/(1.333*pi*DF$diameter^3)` This isn't much
more work than data.table but you don't have the convenience of the `by =`
statement to create calculations based on a grouping variable. You may likely
have to include a `for()` loop to get a similar result. See below for examples
of the advantages of the `by =` statement.
#### Using ":=" to removie variables from a data.table {.bs-callout .bs-callout-green}
Unwanted or erroneous variables can be eliminated from the data by using `NULL`.
We assign this value to the variable names in the **j** expression
```{r}
poolballs[ , c("Id", "pickles") := NULL]
```
This line simply removes the data from data.table memory. Doing the equivalent
with data.frame would re-write the entire object to memory, sans the deleted
variables. Here again is an example of efficient operation that is going on
behind the scenes with data.table.
### Sorting data
`setkey` is a function that will arrange the data based on the values of
your sorting variable.
```{r}
setkey(poolballs, diameter)
poolballs
```
The setkey also allows a *lexicographic* sort according to several
varaibles. Here the data will first be sorted by `stripes` and then by
`weight`. After using setkey the data.table printout reflects the sort.
```{r sort1}
setkey(poolballs, stripes, weight)
poolballs
```
#### Aggregating variables with "by =" {.bs-callout .bs-callout-gray}
We start with a simple benefit of the data.table design. It is
very easy to calculate the average weight of pool balls within
subgroups. This works best if we set the key as `stripes`
```{r sort2}
setkey(poolballs, stripes)
```
And then we will ask for the average weight of pool balls within
the different values of stripes:
```{r rows_weight1}
poolballs[ , mean(weight), by = stripes]
```
Note that the newly created data.table object has one row
for solids and one row for stripes, while the newly created
average column is simply labeled `V1`. We are allowed to
name that column if we wrap the calculation in a list, as
in
```{r rows_weight2}
poolballs[ , list(weight.mean = mean(weight)), by = stripes]
```
The aggregation `by` stripes can be combined with a row-selection
of poolballs. For example, we want the mean of balls that are
numbered one through four, this will work:
```{r rows_weight4}
poolballs[number < 5 , .(weight.mean = mean(weight)), by = stripes]
```
Again we have used the shorthand `.(` instead of `list()`.
Many who have worked with R data frames have used functions like
`tapply`, `by`, and `aggregate` to achieve the same purpose. All of
these will summarize information about subgroups in a data set. The
output from these things is slightly different, but it is mostly
understandable.
```{r}
pooldf$stripes <- ifelse(pooldf$number %% 2, 'solid', 'stripe')
tapply(pooldf$weight, list(pooldf$stripes), mean)
```
or
```{r}
aggregate(pooldf$weight, by = list(diameter = pooldf$stripes), mean)
```
The return objects are structured differently, but the numerical
values are the same.
It may be a little tricky if one wants
to put the group means back "onto" the original data frame.
It is easy with data.table, however, if we use `:=`.
```{r rows_weight10}
poolballs[ , weight.mean := mean(weight), by = stripes]
```
From there, it is easy to calculate "deviations about the mean within
groups".
```{r rowsweight11}
poolballs[ , weight.dev := weight - weight.mean]
```
```{r}
head(poolballs)
```
### More on using `i`, `j` - expressions and `by = k`
Here is more detail on the data.table syntax. There are special
symbolic variables that are used to simplify calculations.
1. `.N` is a symbol referring to the number of rows within
each value of the grouping variable
2. `.SD` is a symbol used to refer to all of the columns.
We will explore these in this section.
#### Accessing blocks of cases by row index {.bs-callout .bs-callout-orange}
You may want various statistics or calculations of your data but only
of certain subsets. You can use an `i` expression to define rows you
want. In addition, a grouping variable `by` can combine rows. If you
have several different groups of rows you'd like to do calculations
on, using a `j` expression is useful.
These useages generally create new data tables, including, for
example, the average values of the variables for each subgroup. In
these cases, we can include row selection in the `i-expression`,
column selection with the `.(j-expression)`, as well as
a `k-expression` that represents a grouping variable.
#### Example with stipes and solids {.bs-callout .bs-callout-blue}
Here is an example of finding the total weight of stripes and solid
balls. We will define the total weight variable as `tot_weight` but
notice this just defines the var in the temporary output and doesn't
define a new permanent variable in `poolballs`. For defining a
permanent addition to `poolballs` we'd use `:=`.
```{r total_weight}
poolballs[ , .(weight_tot = sum(weight)), by = stripes]
```
What about the average weight of a solid or stripe ball? We can use a
handy trick: the `.N` value, which returns the number of rows in each
grouping.
```{r mean_calc}
poolballs[ , .(weight_avg = sum(weight)/.N), by = stripes]
```
Verify that is the same as R's built in mean function that was
used above.
By putting multiple variables, `stripes` and `diameter` in the `by`
statement, you create groups that match up in every combination of
unique `stripes` and `diameter` values.
```{r mean_weight}
poolballs[ , .(weight.mean = mean(weight)), by = .(stripes,diameter)]
```
The `setkey` function makes some short hand possible for row
extraction. Suppose we set the sorting key as `stripes`.
```{r value_index1}
setkey(poolballs, stripes)
```
After that, then when we specify a value in the `i` argument, then that value is
matched against the key variable. Note it is not necessary to specify that the
value "solid" is going to be found in the variable named `stripes` in the
following.
```{r value_index2}
poolballs['solid', ]
```
The output includes all of the balls. The data.table package includes
a keyword that can be used for the `k` argument, `mult`. That is set
as "all" by default, but it can be set as "first" or "last" as well:
```{r value_index3}
poolballs['solid', mult = 'first']
```
```{r value_index4}
poolballs['solid', mult = 'last']
```
### Calculations for the entire data set: .SD
There may be instances where you'd like to compute something using an
entire subset of the data.table instead of just one or two
variables. When using `by = ` the `.SD` reference will supply all
variables besides the one in the `by` statement. Below we return the
variance for all columns of poolballs by stripes grouping. This is
handy for passing the data to a function.
```{r}
poolballs[, lapply(.SD, var), by = stripes]
```
You may want this calculation for all but a few variables in a
dataset. You can define the scope of `.SD` by using `.SDcols` after the `by`
statement. Then you may have the option of using the whole dataset but
could allow a user to define which columns to include.
```{r}
poolballs[ , lapply(.SD, var), by = stripes, .SDcols = c("weight", "diameter")]
```
## Speed DF vs. DT {.tabset .tabset-fade}
This section is partially adapted from the data.table
vignette that is available with the package. Run `vignette("datatable-intro-vignette")`.
#### Setting up a Huge data example {.bs-callout .bs-callout-red}
First we will create some *hyyuuuuge* datasets by repeating the
`poolball` dataset along with some new variables: `manufacturers` and
`price`. The original data and `manufacturers` will be recycled until
we've repeated enough poolballs sets to meet the number of prices
we've just generated, **15e6**
```{r, echo=TRUE, include = TRUE, results='hide', eval=TRUE}
DF <- data.frame(rep(poolballs),
manufacturers=c('James', 'Smith', 'HardBalls', 'Billiard Buddies'),
price = round(rnorm(15e6, 1.65, 2.06), 2))
DT <- data.table(DF)
```
After copying our `DF` into a data.table, `DT`, we can now compare the
timing of some operations with these two identical sets of
data. Defining your dataset as a data.table or data.frame should take
equivalent time.
### Selecting a subset
For a data.table we first need to use `setkey()` to sort or order the
data. This takes a little time but will be well worth the speed
increase in subsequent calculations. Below data will first be sorted
by `manufacturers` and then sorted by `price` second. For these sorted
columns, we can pass a list of values that to get the rows we want.
```{r subset_DT}
system.time(setkey(DT, manufacturers, price)) #initial sorting needed for DT
```
```{r}
system.time(ans1 <- DT[.("Smith",1.69)])
```
This subset assignment happens almost instantly. See how much slower `DF` would be
doing the same thing. A few subset selections and it would take almost as long
as doing the multi variable sorting.
```{r subset_DF}
system.time(ans2 <- DF[DF$manufacturers == c("Smith") & DF$price==1.69,])
```
Note the differences in elapsed time. Let's check that we got the
identical rows both ways (`tail` shows us the last rows, here I've
specified the last *3*)
```{r}
tail(ans1,3)
tail(ans2,3)
```
### Computing a summary of data
Here we find the average price of a poolball by manufacturer.
Using *data.table* we get `ans3` pretty quickly
```{r dt_ballprice}
system.time(ans3 <- DT[,.(manu_mean = mean(price)), by = manufacturers])
ans3
```
Using the *data.frame* is much more cumbersome so we will just
compute the mean for two manufacturers. Again, note the difference in
elapsed time.
```{r man_mean_DF}
system.time(Smithmean <- mean(DF[DF$manufacturers == "Smith",]$price),
Jamesmean <- mean(DF[DF$manufacturers == "James",]$price))
```
Now we can just check they are the same as above.
```{r, results = 'hide'}
Jamesmean
Smithmean
```
Using aggregate we can get the same results from one line with the
DF but the results come at a big computing cost.
```{r, results = 'hold'}
system.time(ans4 <- aggregate(DF$price, by = list(Manu_mean = DF$manufacturers), mean))
ans4
```
Using `tapply` with Data.Frame's is much more efficient.
```{r, results='hold'}
system.time(ans5 <- tapply(DF$price, list(Manu_mean = DF$manufacturers), mean))
ans5
```
### Temporary values
You can execute multiple calculations in one j-expression using `{ }`. Here we can
even create a temporary variable, `cost`, and compute another value using the
temporary variable. Here we find the profit by manufacturer and stripes,
assuming the cost is 3 cents per gram weight. Notice that subsequent calculations
should be separated with a `;` not a `,`. This approach will make more
complex calculations easier to read and take much less time.
```{r}
system.time(DT[ , {cost = weight*0.03; profit = mean(price-cost)},
by = .(stripes, manufacturers)])
```
```{r, echo = FALSE}
DT[ , {cost = weight*0.03; mean(price-cost)},
by = .(stripes, manufacturers)]
```
This specific example can be be done in one line with *data.frame*,
using `tapply`. If you were making multiple processing steps, using multiple
functions tapply may not work or you'd have to start imbedding them and it
would get messy. But for this example, if you're a fan of data.frames, `$` and
`tapply` then you can do what data.table does too.
```{r}
system.time(
tapply((DF$price-DF$weight*0.03), list(DF$stripes,DF$manufacturers),
mean))
```
## Chaining
Another way to accomplish a whole lot in one line is by
'Chaining'. This method utilizes the fact that every `DT[ ]` call
returns the data or subset of data defined by the `i`, `j` and `k`
expressions. With this new subset you may again call new `i`, `j` and `k`
expressions but adding another `[i,j,k=by ]` after the first
`DT[i,j,k=by]`.
The example below attempts to find the most expensive average cost poolball per
manufacturer.
```{r chaining}
setkey(DT,number, stripes, manufacturers)
```
```{r chaining2, eval = FALSE}
DT[,.(meancost = mean(price-weight*0.03)), by=.(stripes, manufacturers)][ ,
.(number, stripes,expensive = max(meancost)),by = manufacturers]
```
***This returns an error***
We attempted to get the `number` of the ball that was most expensive
but because we are looking for the `number` variable in the second
`[ ]` it isn't there. In the second `[ ]` we are working with a subset of
the original data.table, generated by the first `[ ]`, which has
excluded `number` and only includes `stripes`, `manufactureres` and
`meancost`. We get the most expensive ball for each
manufacturer/stripes below.
```{r}
DT[,.(meancost = mean(price-weight*0.03)), by=.(manufacturers,stripes)][ ,
.(expensive = max(meancost)), by = manufacturers]
```
## Conclusions
Data.table allows for some very intuitive indexing and subset
calculations that take far less time than data frame on very large
datasets. The approaches shown here will aid in flexible, easier to
read programming that can be accomplished in fewer lines of code. The
same calculations using data.frame will often take much
longer. Functions like sorting and summarizing data are done much
faster. Complex calculations of datasets are easily acquired and can
often be done in one line with chaining or calling functions in the
j-expressions. It's cool, try it.
## Additional Resources
* Vignettes included with the data.table package (see the list in
output from `help(package = "data.table")`
* StackOverflow has a very active discussion area for data.table:
http://stackoverflow.com/questions/tagged/data.table
* Data Analysis the Data.Table Way: The official Cheat Sheet for the
DataCamp course
[cheat sheet](https://s3.amazonaws.com/assets.datacamp.com/img/blog/data+table+cheat+sheet.pdf)
* Advanced tips and tricks with data.table (by Andrew
Brooks)
[advanced data.table](http://brooksandrew.github.io/simpleblog/articles/advanced-data-table/)
[//]: (All guides must have this as the final stanza)
## R Session Info
```{r sessionInfo, echo = FALSE}
sessionInfo()
```
```{r warnings, echo = FALSE}
if(!is.null(warnings())){
print("Warnings:")
warnings()
}
```
```{r RoptionsRestore, echo=FALSE, include=FALSE}
## Don't delete this. It puts the interactive session options
## back the way they were. If this is compiled within a session
## it is vital to do this.
options(opts.orig)
par(par.orig)
```
Available under
[Created Commons license 3.0 ](http://creativecommons.org/licenses/by/3.0/)