Install and Load the Data.Table package in R

The data.table package is available on CRAN, the default package repository. Installation can be accomplished with install.packages("data.table").

Load the data.table package into a session, just like any other package. The data.table package prints out a brief advertisement, as we will see.

  library(data.table)

For replication purposes, we will set the random seed as

set.seed(2345321)

The R session information is included the end of this guide.

Syntax: Pooltable Example

This section will introduce the syntax used to generate a new data.table, new variables, indexing the dataset, and creating summaries of subsets and the whole dataset.

Generate a data.table

Here we will create a dataset describing a set of billiard balls. The balls are numbered 1 through 15, the weight is generated by a random normal distribution, and the diameter is set by discrete values. Here we only define 3 values for the diameter but recycling will fill the data.table. By defining number and weight with 15 values we have set the length of the data.table. Defining another variable in instantiation of a data table with fewer than this 15 length will signal to data.table to recycle these numbers until 15 values are entered for that variable. You can see these three numbers repeated 5 times to fill up the diameter variable with 15 values. Try setting number to 1:16 or another range.

The syntax to create a data.table is identical to the syntax that would be required for a data.frame, except that we replace data.frame with data.table.

poolballs <- data.table(
    number = 1:15,
    weight = rnorm(15, 45.7, 0.8),
    diameter = c(3, 2.9, 3.1) #shows recyling
) 
poolballs
    number   weight diameter
 1:      1 45.36666      3.0
 2:      2 45.58336      2.9
 3:      3 44.51368      3.1
 4:      4 44.16628      3.0
 5:      5 44.91709      2.9
 6:      6 45.43478      3.1
 7:      7 45.88063      3.0
 8:      8 44.46412      2.9
 9:      9 45.00104      3.1
10:     10 46.36696      3.0
11:     11 46.43844      2.9
12:     12 45.92276      3.1
13:     13 46.89205      3.0
14:     14 46.53843      2.9
15:     15 46.03637      3.1

For comparison with the data.table version of the billiard ball data, we’ll make a data.frame that has the same information. This is called pooldf for clarity.

set.seed(2345321)
pooldf <- data.frame(
    number = 1:15,
    weight = rnorm(15, 45.7, 0.8),
    diameter = c(3, 2.9, 3.1)
) 
pooldf
   number   weight diameter
1       1 45.36666      3.0
2       2 45.58336      2.9
3       3 44.51368      3.1
4       4 44.16628      3.0
5       5 44.91709      2.9
6       6 45.43478      3.1
7       7 45.88063      3.0
8       8 44.46412      2.9
9       9 45.00104      3.1
10     10 46.36696      3.0
11     11 46.43844      2.9
12     12 45.92276      3.1
13     13 46.89205      3.0
14     14 46.53843      2.9
15     15 46.03637      3.1

We could achieve the same effect by using the function as.data.frame supplied by the data.table package.

Note that the data.table object seems to have column names and row names in the same way that a data.frame would:

colnames(poolballs)
[1] "number"   "weight"   "diameter"
rownames(poolballs)
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13"
[14] "14" "15"

The data.table seems to have column and row names as usual, but they cannot be used in the ways we expect. This is surely one of the most interesting (and frustrating) differences between data.tables and data.frames. This difference, which we will illustrate with examples, is a major source of confusion among new data.table users.

Data Table indexing

Our poolballs object is a data.table. Using the syntax poolballs[i, j, by = k] we can define what rows – i – and columns – j – we want from our data. The third argument has many uses. Here, we have a third argument by, which signals that we want to group the results according to a variable k. There are many other keywords we can use in that third position, such as mult and with.

Grabbing specific rows

With the data.table, one can ask for rows by row number. We can retrieve rows 1 through 3 in the expected way

poolballs[1:3,]
   number   weight diameter
1:      1 45.36666      3.0
2:      2 45.58336      2.9
3:      3 44.51368      3.1

We can achieve the same effect by choosing rows by a logical condition. In this case, we want the rows for which the value of the variable number is less than 4. The following is equivalent.

poolballs[number < 4,]
   number   weight diameter
1:      1 45.36666      3.0
2:      2 45.58336      2.9
3:      3 44.51368      3.1

Users who are familiar with data.frames will be surprised that the column name number does not need to be quoted. The data.frame equivalent request would be rather more complicated in appearance:

pooldf[pooldf$number < 4, ]
  number   weight diameter
1      1 45.36666      3.0
2      2 45.58336      2.9
3      3 44.51368      3.1

These examples indicate that the data.table’s row-choosing square brackets allow either integer row indices or an expression to define a logical condition. The syntax for the data.frame alternative is either more clear or more tedious, depending on your taste. The fact the data.table poolballs is able to find the column number within itself is a significant syntactical difference.

Choosing elements by rowname causes an error in a data.table. Recall that the rownames were retrieved above, they are “1” through “15”, as characters.

> poolballs["1", ]
Error in `[.data.table`(poolballs, "1", ) : 
When i is a data.table (or character vector),
x must be keyed (i.e. sorted, and, marked as sorted) so
data.table knows which columns to join to and take advantage 
of x being sorted. Call setkey(x,...) first, see ?setkey.

Grabbing specific variables (columns)

One of the most notable differences between a data.table and a data.frame is that the column names must be used without quotation marks in a data.table. In a data.frame object named pooldf we would choose the column named “weight” by

pooldf[ , "weight"]

which is equivalent to

pooldf$weight

The data.table class will allow the following. Look, no quotes!:

poolballs[ , weight]
 [1] 45.36666 45.58336 44.51368 44.16628 44.91709 45.43478 45.88063
 [8] 44.46412 45.00104 46.36696 46.43844 45.92276 46.89205 46.53843
[15] 46.03637

There is also a different style to extract a colletion of columns from a data.table object. In the case of a data.frame, we choose two columns by name in this way

pooldf[ , c("weight", "diameter")]

while the data.table does not expect either a vector or quoted strings as arguments. The data.table expects a list.

poolballs[ , list(weight, diameter)]

or the shorthand .( for the list. This syntax is adopted from the package dplyr by Hadley Wickham.

poolballs[ , .(weight, diameter)]
      weight diameter
 1: 45.36666      3.0
 2: 45.58336      2.9
 3: 44.51368      3.1
 4: 44.16628      3.0
 5: 44.91709      2.9
 6: 45.43478      3.1
 7: 45.88063      3.0
 8: 44.46412      2.9
 9: 45.00104      3.1
10: 46.36696      3.0
11: 46.43844      2.9
12: 45.92276      3.1
13: 46.89205      3.0
14: 46.53843      2.9
15: 46.03637      3.1

Grabing columns by numbered index wors as it would with data.frame, but only in recent version of data.table.

poolballs[, 1]
    number
 1:      1
 2:      2
 3:      3
 4:      4
 5:      5
 6:      6
 7:      7
 8:      8
 9:      9
10:     10
11:     11
12:     12
13:     13
14:     14
15:     15

Data frames and data.tables allow access by column number, but there are good practical arguments against that kind of coding. Fewer unexpected results happen if we access the columns by their full name. Using variable names instead of column index numbers encourages flexible code development and reduces the number of user errors (i.e. when importing several datasets and one of the columns is not in the correct order a column name call will return the correct data while calling column 5 may return errors or include data from the incorrect variable).

The first column is named “number” in this example.

poolballs[, number]
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15

The columns in a data.table can be renamed with the setnames function from the data.table package. If DT is a data.table, we could change the names of the columns V1, V2, and V3 as follows.

setnames(DT,c("V1","V2","V3"), c("name1", "name2", "name3"))

Using data.frame DF changing columns names would required the following syntax

colnames(DT) <- c("name1", "name2", "name3")

This leaves the programmer or code reader blind to the previous colnames that are being overwritten. The data.frame approach rewrites the entire data.frame with the new colnames to memory. This highlights the major efficiency with data.table and it’s conservative use of memory. We’ll explore the processing speed differences between these two structures later in this guide.

Generating new variables within the data.table

In an ordinary R data.frame, adding a variable causes some seemingly unnecessary re-calculations. Adding a column causes a complete rewrite of all of the data.frame’s columns. It seems more efficient to leave the old columns where they were and simply add a new column. This seemingly obvious efficiency was one of the motivating ideas behind the data.table project in the first place.

New assignment symbol “:=”

The syntax of data.table allows the assignment of one new variable in DT with

DT[ , Xnew := any_function(x)]

The symbol := indicates we want to create a variable within DT. any_function represents any R function acting on a variable x, that already exists in DT. The new variable name Xnew must be a valid R column name. Here we put a comma inside the square brackets for clarity, [ , , to remind ourselves that we are creating this new variable for all rows. Then define the variable name and use := to define the variable’s value based on a calculation of other DT variables. We can manufacture several variables at once using a list syntax that will be illustrated below.

Example creating a new variable

Here we will define a new categorical variable in poolballs data.table called stripes, which indicates if a certain ball is a solid or stripe. We assign the values with the operator :=.

poolballs[ , stripes := ifelse(number%%2, 'solid', 'stripe')]
head(poolballs)

We used the R base ifelse function that lets us check a logical condition and assign one of two values if our first argument is true or false. The argument we use is modulo(%%) 2 of the number variable. So every other ball will be set as a solid or strip as is typical in billiard ball design.

It is possible to do any R computation We’ll use the head function to look at just the first 5 rows.

poolballs[ , density := weight/(1.333*pi*diameter^3)]
head(poolballs,5)
   number   weight diameter stripes   density
1:      1 45.36666      3.0   solid 0.4012296
2:      2 45.58336      2.9  stripe 0.4463056
3:      3 44.51368      3.1   solid 0.3568029
4:      4 44.16628      3.0  stripe 0.3906133
5:      5 44.91709      2.9   solid 0.4397821

You can generate multiple new variables in one line by using list() or the shorthand .(). You may noticed how often shorthand is used along with data.table as those building and using this package tend to write code as breif as possible. Below we use := to assign multiple variables in one line and add them to the dataset.

poolballs[ , `:=` (Id = sample(1:250, size=15), pickles = 'whatever')]
head(poolballs,5)
   number   weight diameter stripes   density  Id  pickles
1:      1 45.36666      3.0   solid 0.4012296 231 whatever
2:      2 45.58336      2.9  stripe 0.4463056  52 whatever
3:      3 44.51368      3.1   solid 0.3568029 136 whatever
4:      4 44.16628      3.0  stripe 0.3906133 215 whatever
5:      5 44.91709      2.9   solid 0.4397821 229 whatever

Assigning new variables in data.frames requires the $ and assignment object <-. i.e. DF$NewVarName <- DF$weight/(1.333*pi*DF$diameter^3) This isn’t much more work than data.table but you don’t have the convenience of the by = statement to create calculations based on a grouping variable. You may likely have to include a for() loop to get a similar result. See below for examples of the advantages of the by = statement.

Using “:=” to removie variables from a data.table

Unwanted or erroneous variables can be eliminated from the data by using NULL. We assign this value to the variable names in the j expression

poolballs[ , c("Id", "pickles") := NULL]

This line simply removes the data from data.table memory. Doing the equivalent with data.frame would re-write the entire object to memory, sans the deleted variables. Here again is an example of efficient operation that is going on behind the scenes with data.table.

Sorting data

setkey is a function that will arrange the data based on the values of your sorting variable.

setkey(poolballs, diameter)
poolballs
    number   weight diameter stripes   density
 1:      2 45.58336      2.9  stripe 0.4463056
 2:      5 44.91709      2.9   solid 0.4397821
 3:      8 44.46412      2.9  stripe 0.4353472
 4:     11 46.43844      2.9   solid 0.4546776
 5:     14 46.53843      2.9  stripe 0.4556566
 6:      1 45.36666      3.0   solid 0.4012296
 7:      4 44.16628      3.0  stripe 0.3906133
 8:      7 45.88063      3.0   solid 0.4057753
 9:     10 46.36696      3.0  stripe 0.4100764
10:     13 46.89205      3.0   solid 0.4147205
11:      3 44.51368      3.1   solid 0.3568029
12:      6 45.43478      3.1  stripe 0.3641860
13:      9 45.00104      3.1   solid 0.3607094
14:     12 45.92276      3.1  stripe 0.3680975
15:     15 46.03637      3.1   solid 0.3690082

The setkey also allows a lexicographic sort according to several varaibles. Here the data will first be sorted by stripes and then by weight. After using setkey the data.table printout reflects the sort.

setkey(poolballs, stripes, weight)
poolballs
    number   weight diameter stripes   density
 1:      3 44.51368      3.1   solid 0.3568029
 2:      5 44.91709      2.9   solid 0.4397821
 3:      9 45.00104      3.1   solid 0.3607094
 4:      1 45.36666      3.0   solid 0.4012296
 5:      7 45.88063      3.0   solid 0.4057753
 6:     15 46.03637      3.1   solid 0.3690082
 7:     11 46.43844      2.9   solid 0.4546776
 8:     13 46.89205      3.0   solid 0.4147205
 9:      4 44.16628      3.0  stripe 0.3906133
10:      8 44.46412      2.9  stripe 0.4353472
11:      6 45.43478      3.1  stripe 0.3641860
12:      2 45.58336      2.9  stripe 0.4463056
13:     12 45.92276      3.1  stripe 0.3680975
14:     10 46.36696      3.0  stripe 0.4100764
15:     14 46.53843      2.9  stripe 0.4556566

Aggregating variables with “by =”

We start with a simple benefit of the data.table design. It is very easy to calculate the average weight of pool balls within subgroups. This works best if we set the key as stripes

setkey(poolballs, stripes)

And then we will ask for the average weight of pool balls within the different values of stripes:

poolballs[ , mean(weight), by = stripes]
   stripes       V1
1:   solid 45.63075
2:  stripe 45.49667

Note that the newly created data.table object has one row for solids and one row for stripes, while the newly created average column is simply labeled V1. We are allowed to name that column if we wrap the calculation in a list, as in

poolballs[ , list(weight.mean = mean(weight)), by = stripes]
   stripes weight.mean
1:   solid    45.63075
2:  stripe    45.49667

The aggregation by stripes can be combined with a row-selection of poolballs. For example, we want the mean of balls that are numbered one through four, this will work:

poolballs[number < 5 , .(weight.mean = mean(weight)), by = stripes]
   stripes weight.mean
1:   solid    44.94017
2:  stripe    44.87482

Again we have used the shorthand .( instead of list(). Many who have worked with R data frames have used functions like tapply, by, and aggregate to achieve the same purpose. All of these will summarize information about subgroups in a data set. The output from these things is slightly different, but it is mostly understandable.

pooldf$stripes <- ifelse(pooldf$number %% 2, 'solid', 'stripe')
tapply(pooldf$weight, list(pooldf$stripes), mean)
   solid   stripe 
45.63075 45.49667 

or

aggregate(pooldf$weight, by = list(diameter = pooldf$stripes), mean)
  diameter        x
1    solid 45.63075
2   stripe 45.49667

The return objects are structured differently, but the numerical values are the same.

It may be a little tricky if one wants to put the group means back “onto” the original data frame. It is easy with data.table, however, if we use :=.

poolballs[ , weight.mean := mean(weight), by = stripes]

From there, it is easy to calculate “deviations about the mean within groups”.

poolballs[ , weight.dev := weight - weight.mean]
head(poolballs)
   number   weight diameter stripes   density weight.mean weight.dev
1:      3 44.51368      3.1   solid 0.3568029    45.63075 -1.1170630
2:      5 44.91709      2.9   solid 0.4397821    45.63075 -0.7136564
3:      9 45.00104      3.1   solid 0.3607094    45.63075 -0.6297099
4:      1 45.36666      3.0   solid 0.4012296    45.63075 -0.2640876
5:      7 45.88063      3.0   solid 0.4057753    45.63075  0.2498869
6:     15 46.03637      3.1   solid 0.3690082    45.63075  0.4056294

More on using i, j - expressions and by = k

Here is more detail on the data.table syntax. There are special symbolic variables that are used to simplify calculations.

  1. .N is a symbol referring to the number of rows within each value of the grouping variable

  2. .SD is a symbol used to refer to all of the columns.

We will explore these in this section.

Accessing blocks of cases by row index

You may want various statistics or calculations of your data but only of certain subsets. You can use an i expression to define rows you want. In addition, a grouping variable by can combine rows. If you have several different groups of rows you’d like to do calculations on, using a j expression is useful.

These useages generally create new data tables, including, for example, the average values of the variables for each subgroup. In these cases, we can include row selection in the i-expression, column selection with the .(j-expression), as well as a k-expression that represents a grouping variable.

Example with stipes and solids

Here is an example of finding the total weight of stripes and solid balls. We will define the total weight variable as tot_weight but notice this just defines the var in the temporary output and doesn’t define a new permanent variable in poolballs. For defining a permanent addition to poolballs we’d use :=.

poolballs[ , .(weight_tot = sum(weight)), by = stripes]
   stripes weight_tot
1:   solid   365.0460
2:  stripe   318.4767

What about the average weight of a solid or stripe ball? We can use a handy trick: the .N value, which returns the number of rows in each grouping.

poolballs[ , .(weight_avg = sum(weight)/.N), by = stripes]
   stripes weight_avg
1:   solid   45.63075
2:  stripe   45.49667

Verify that is the same as R’s built in mean function that was used above.

By putting multiple variables, stripes and diameter in the by statement, you create groups that match up in every combination of unique stripes and diameter values.

poolballs[ , .(weight.mean = mean(weight)), by = .(stripes,diameter)]
   stripes diameter weight.mean
1:   solid      3.1    45.18370
2:   solid      2.9    45.67776
3:   solid      3.0    46.04645
4:  stripe      3.0    45.26662
5:  stripe      2.9    45.52864
6:  stripe      3.1    45.67877

The setkey function makes some short hand possible for row extraction. Suppose we set the sorting key as stripes.

setkey(poolballs, stripes)

After that, then when we specify a value in the i argument, then that value is matched against the key variable. Note it is not necessary to specify that the value “solid” is going to be found in the variable named stripes in the following.

poolballs['solid', ]
   number   weight diameter stripes   density weight.mean weight.dev
1:      3 44.51368      3.1   solid 0.3568029    45.63075 -1.1170630
2:      5 44.91709      2.9   solid 0.4397821    45.63075 -0.7136564
3:      9 45.00104      3.1   solid 0.3607094    45.63075 -0.6297099
4:      1 45.36666      3.0   solid 0.4012296    45.63075 -0.2640876
5:      7 45.88063      3.0   solid 0.4057753    45.63075  0.2498869
6:     15 46.03637      3.1   solid 0.3690082    45.63075  0.4056294
7:     11 46.43844      2.9   solid 0.4546776    45.63075  0.8076923
8:     13 46.89205      3.0   solid 0.4147205    45.63075  1.2613083

The output includes all of the balls. The data.table package includes a keyword that can be used for the k argument, mult. That is set as “all” by default, but it can be set as “first” or “last” as well:

poolballs['solid', mult = 'first'] 
   number   weight diameter stripes   density weight.mean weight.dev
1:      3 44.51368      3.1   solid 0.3568029    45.63075  -1.117063
poolballs['solid', mult = 'last'] 
   number   weight diameter stripes   density weight.mean weight.dev
1:     13 46.89205        3   solid 0.4147205    45.63075   1.261308

Calculations for the entire data set: .SD

There may be instances where you’d like to compute something using an entire subset of the data.table instead of just one or two variables. When using by = the .SD reference will supply all variables besides the one in the by statement. Below we return the variance for all columns of poolballs by stripes grouping. This is handy for passing the data to a function.

poolballs[, lapply(.SD, var), by  = stripes]
   stripes   number    weight    diameter     density weight.mean
1:   solid 24.00000 0.6705223 0.006964286 0.001313305           0
2:  stripe 18.66667 0.8119409 0.008095238 0.001379280           0
   weight.dev
1:  0.6705223
2:  0.8119409

You may want this calculation for all but a few variables in a dataset. You can define the scope of .SD by using .SDcols after the by statement. Then you may have the option of using the whole dataset but could allow a user to define which columns to include.

poolballs[ , lapply(.SD, var), by  = stripes, .SDcols = c("weight", "diameter")]
   stripes    weight    diameter
1:   solid 0.6705223 0.006964286
2:  stripe 0.8119409 0.008095238

Speed DF vs. DT

This section is partially adapted from the data.table vignette that is available with the package. Run vignette("datatable-intro-vignette").

Setting up a Huge data example

First we will create some hyyuuuuge datasets by repeating the poolball dataset along with some new variables: manufacturers and price. The original data and manufacturers will be recycled until we’ve repeated enough poolballs sets to meet the number of prices we’ve just generated, 15e6

DF <- data.frame(rep(poolballs), 
           manufacturers=c('James', 'Smith', 'HardBalls', 'Billiard Buddies'),
           price = round(rnorm(15e6, 1.65, 2.06), 2))

DT <- data.table(DF)

After copying our DF into a data.table, DT, we can now compare the timing of some operations with these two identical sets of data. Defining your dataset as a data.table or data.frame should take equivalent time.

Selecting a subset

For a data.table we first need to use setkey() to sort or order the data. This takes a little time but will be well worth the speed increase in subsequent calculations. Below data will first be sorted by manufacturers and then sorted by price second. For these sorted columns, we can pass a list of values that to get the rows we want.

  system.time(setkey(DT, manufacturers, price)) #initial sorting needed for DT
   user  system elapsed 
  9.663   1.379   2.791 
  system.time(ans1 <- DT[.("Smith",1.69)])
   user  system elapsed 
  0.002   0.000   0.003 

This subset assignment happens almost instantly. See how much slower DF would be doing the same thing. A few subset selections and it would take almost as long as doing the multi variable sorting.

  system.time(ans2 <- DF[DF$manufacturers == c("Smith") & DF$price==1.69,])
   user  system elapsed 
   0.21    0.02    0.23 

Note the differences in elapsed time. Let’s check that we got the identical rows both ways (tail shows us the last rows, here I’ve specified the last 3)

  tail(ans1,3)
   number   weight diameter stripes   density weight.mean  weight.dev
1:      6 45.43478      3.1  stripe 0.3641860    45.49667 -0.06189089
2:     15 46.03637      3.1   solid 0.3690082    45.63075  0.40562938
3:     14 46.53843      2.9  stripe 0.4556566    45.49667  1.04175713
   manufacturers price
1:         Smith  1.69
2:         Smith  1.69
3:         Smith  1.69
  tail(ans2,3)
         number   weight diameter stripes   density weight.mean
14995586      6 45.43478      3.1  stripe 0.3641860    45.49667
14996466     15 46.03637      3.1   solid 0.3690082    45.63075
14998230     14 46.53843      2.9  stripe 0.4556566    45.49667
          weight.dev manufacturers price
14995586 -0.06189089         Smith  1.69
14996466  0.40562938         Smith  1.69
14998230  1.04175713         Smith  1.69

Computing a summary of data

Here we find the average price of a poolball by manufacturer. Using data.table we get ans3 pretty quickly

system.time(ans3 <- DT[,.(manu_mean = mean(price)), by = manufacturers])
   user  system elapsed 
  0.245   0.020   0.265 
ans3
      manufacturers manu_mean
1: Billiard Buddies  1.648792
2:        HardBalls  1.651874
3:            James  1.649076
4:            Smith  1.650318

Using the data.frame is much more cumbersome so we will just compute the mean for two manufacturers. Again, note the difference in elapsed time.

system.time(Smithmean <- mean(DF[DF$manufacturers == "Smith",]$price), 
            Jamesmean <- mean(DF[DF$manufacturers == "James",]$price))
   user  system elapsed 
  0.461   0.133   0.594 

Now we can just check they are the same as above.

Jamesmean
Smithmean

Using aggregate we can get the same results from one line with the DF but the results come at a big computing cost.

system.time(ans4 <- aggregate(DF$price, by = list(Manu_mean = DF$manufacturers), mean))
ans4
   user  system elapsed 
  4.426   0.640   5.066 
         Manu_mean        x
1 Billiard Buddies 1.648792
2        HardBalls 1.651874
3            James 1.649076
4            Smith 1.650318

Using tapply with Data.Frame’s is much more efficient.

system.time(ans5 <- tapply(DF$price, list(Manu_mean = DF$manufacturers), mean))
ans5
   user  system elapsed 
  0.147   0.083   0.230 
Manu_mean
Billiard Buddies        HardBalls            James            Smith 
        1.648792         1.651874         1.649076         1.650318 

Temporary values

You can execute multiple calculations in one j-expression using { }. Here we can even create a temporary variable, cost, and compute another value using the temporary variable. Here we find the profit by manufacturer and stripes, assuming the cost is 3 cents per gram weight. Notice that subsequent calculations should be separated with a ; not a ,. This approach will make more complex calculations easier to read and take much less time.

system.time(DT[ , {cost = weight*0.03; profit = mean(price-cost)}, 
               by = .(stripes, manufacturers)])
   user  system elapsed 
  0.479   0.080   0.538 
   stripes    manufacturers        V1
1:  stripe Billiard Buddies 0.2835304
2:   solid Billiard Buddies 0.2801864
3:   solid        HardBalls 0.2829155
4:  stripe        HardBalls 0.2870156
5:   solid            James 0.2809137
6:  stripe            James 0.2833062
7:   solid            Smith 0.2801392
8:  stripe            Smith 0.2868533

This specific example can be be done in one line with data.frame, using tapply. If you were making multiple processing steps, using multiple functions tapply may not work or you’d have to start imbedding them and it would get messy. But for this example, if you’re a fan of data.frames, $ and tapply then you can do what data.table does too.

system.time(
  tapply((DF$price-DF$weight*0.03), list(DF$stripes,DF$manufacturers),
    mean))
   user  system elapsed 
  0.294   0.128   0.422 

Chaining

Another way to accomplish a whole lot in one line is by ‘Chaining’. This method utilizes the fact that every DT[ ] call returns the data or subset of data defined by the i, j and k expressions. With this new subset you may again call new i, j and k expressions but adding another [i,j,k=by ] after the first DT[i,j,k=by].

The example below attempts to find the most expensive average cost poolball per manufacturer.

setkey(DT,number, stripes, manufacturers)
DT[,.(meancost = mean(price-weight*0.03)), by=.(stripes, manufacturers)][ ,
      .(number, stripes,expensive = max(meancost)),by = manufacturers]

This returns an error

We attempted to get the number of the ball that was most expensive but because we are looking for the number variable in the second [ ] it isn’t there. In the second [ ] we are working with a subset of the original data.table, generated by the first [ ], which has excluded number and only includes stripes, manufactureres and meancost. We get the most expensive ball for each manufacturer/stripes below.

DT[,.(meancost = mean(price-weight*0.03)), by=.(manufacturers,stripes)][ ,
                          .(expensive = max(meancost)), by = manufacturers]
      manufacturers expensive
1: Billiard Buddies 0.2835304
2:        HardBalls 0.2870156
3:            James 0.2833062
4:            Smith 0.2868533

Conclusions

Data.table allows for some very intuitive indexing and subset calculations that take far less time than data frame on very large datasets. The approaches shown here will aid in flexible, easier to read programming that can be accomplished in fewer lines of code. The same calculations using data.frame will often take much longer. Functions like sorting and summarizing data are done much faster. Complex calculations of datasets are easily acquired and can often be done in one line with chaining or calling functions in the j-expressions. It’s cool, try it.

Additional Resources

R Session Info

R version 3.5.0 (2018-04-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.1 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] data.table_1.11.4 stationery_0.85  

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.17    knitr_1.20      magrittr_1.5    kutils_1.46    
 [5] mnormt_1.5-5    pbivnorm_0.6.0  xtable_1.8-2    stringr_1.3.1  
 [9] plyr_1.8.4      tools_3.5.0     htmltools_0.3.6 yaml_2.1.19    
[13] digest_0.6.15   rprojroot_1.3-2 lavaan_0.6-1    zip_1.0.0      
[17] evaluate_0.10.1 rmarkdown_1.10  openxlsx_4.1.0  stringi_1.2.3  
[21] compiler_3.5.0  backports_1.1.2 stats4_3.5.0    foreign_0.8-70 

Available under Created Commons license 3.0 CC BY