Guide No: 40
Jeremy Burnison, CRMDA, KU <jburnison@ku.edu>
Paul Johnson, CRMDA, KU <pauljohn@ku.edu>
Please visit http://crmda.ku.edu/guides
Keywords: guide, data.table, SQL
Jul. 30 2018
Abstract
This is an introduction to the data.table package in R. data.table is a high performance replacement for R’s built-in data.frame. There are significant differences in the syntax for usage. Proponents of data.table prefer its simpler notation and ease of use. While the style differences are noteworthy, for most who consider adopting data.table, the major advantage of data.table over the standard data.frame format is speed in processing. This makes data.table desirable for ‘big data’ analysis. While data.table is an alternative to data.frame, it is intended to be compatible with most functions that expect to receive a data.frame object. That is to say, data.frame-centric code will work with a data.table, but it will run faster! Included in this tutorial are examples of indexing and aggregating data variables using the data.table syntax. The first example dataset used is not large so that you can see what happens to the data with the syntax used.
The data.table package is available on CRAN, the default package repository. Installation can be accomplished with install.packages("data.table")
.
Load the data.table package into a session, just like any other package. The data.table package prints out a brief advertisement, as we will see.
library(data.table)
For replication purposes, we will set the random seed as
set.seed(2345321)
The R session information is included the end of this guide.
This section will introduce the syntax used to generate a new data.table, new variables, indexing the dataset, and creating summaries of subsets and the whole dataset.
Here we will create a dataset describing a set of billiard balls. The balls are numbered 1 through 15, the weight is generated by a random normal distribution, and the diameter is set by discrete values. Here we only define 3 values for the diameter but recycling will fill the data.table. By defining number and weight with 15 values we have set the length of the data.table. Defining another variable in instantiation of a data table with fewer than this 15 length will signal to data.table to recycle these numbers until 15 values are entered for that variable. You can see these three numbers repeated 5 times to fill up the diameter variable with 15 values. Try setting number to 1:16 or another range.
The syntax to create a data.table is identical to the syntax that would be required for a data.frame, except that we replace data.frame
with data.table
.
poolballs <- data.table(
number = 1:15,
weight = rnorm(15, 45.7, 0.8),
diameter = c(3, 2.9, 3.1) #shows recyling
)
poolballs
number weight diameter
1: 1 45.36666 3.0
2: 2 45.58336 2.9
3: 3 44.51368 3.1
4: 4 44.16628 3.0
5: 5 44.91709 2.9
6: 6 45.43478 3.1
7: 7 45.88063 3.0
8: 8 44.46412 2.9
9: 9 45.00104 3.1
10: 10 46.36696 3.0
11: 11 46.43844 2.9
12: 12 45.92276 3.1
13: 13 46.89205 3.0
14: 14 46.53843 2.9
15: 15 46.03637 3.1
For comparison with the data.table version of the billiard ball data, we’ll make a data.frame that has the same information. This is called pooldf
for clarity.
set.seed(2345321)
pooldf <- data.frame(
number = 1:15,
weight = rnorm(15, 45.7, 0.8),
diameter = c(3, 2.9, 3.1)
)
pooldf
number weight diameter
1 1 45.36666 3.0
2 2 45.58336 2.9
3 3 44.51368 3.1
4 4 44.16628 3.0
5 5 44.91709 2.9
6 6 45.43478 3.1
7 7 45.88063 3.0
8 8 44.46412 2.9
9 9 45.00104 3.1
10 10 46.36696 3.0
11 11 46.43844 2.9
12 12 45.92276 3.1
13 13 46.89205 3.0
14 14 46.53843 2.9
15 15 46.03637 3.1
We could achieve the same effect by using the function as.data.frame
supplied by the data.table package.
Note that the data.table object seems to have column names and row names in the same way that a data.frame would:
colnames(poolballs)
[1] "number" "weight" "diameter"
rownames(poolballs)
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13"
[14] "14" "15"
The data.table seems to have column and row names as usual, but they cannot be used in the ways we expect. This is surely one of the most interesting (and frustrating) differences between data.tables and data.frames. This difference, which we will illustrate with examples, is a major source of confusion among new data.table users.
Our poolballs object is a data.table. Using the syntax poolballs[i, j, by = k]
we can define what rows – i
– and columns – j
– we want from our data. The third argument has many uses. Here, we have a third argument by
, which signals that we want to group the results according to a variable k
. There are many other keywords we can use in that third position, such as mult
and with
.
With the data.table, one can ask for rows by row number. We can retrieve rows 1 through 3 in the expected way
poolballs[1:3,]
number weight diameter
1: 1 45.36666 3.0
2: 2 45.58336 2.9
3: 3 44.51368 3.1
We can achieve the same effect by choosing rows by a logical condition. In this case, we want the rows for which the value of the variable number
is less than 4. The following is equivalent.
poolballs[number < 4,]
number weight diameter
1: 1 45.36666 3.0
2: 2 45.58336 2.9
3: 3 44.51368 3.1
Users who are familiar with data.frames will be surprised that the column name number
does not need to be quoted. The data.frame equivalent request would be rather more complicated in appearance:
pooldf[pooldf$number < 4, ]
number weight diameter
1 1 45.36666 3.0
2 2 45.58336 2.9
3 3 44.51368 3.1
These examples indicate that the data.table’s row-choosing square brackets allow either integer row indices or an expression to define a logical condition. The syntax for the data.frame alternative is either more clear or more tedious, depending on your taste. The fact the data.table poolballs
is able to find the column number
within itself is a significant syntactical difference.
Choosing elements by rowname causes an error in a data.table. Recall that the rownames were retrieved above, they are “1” through “15”, as characters.
> poolballs["1", ]
Error in `[.data.table`(poolballs, "1", ) :
When i is a data.table (or character vector),
x must be keyed (i.e. sorted, and, marked as sorted) so
data.table knows which columns to join to and take advantage
of x being sorted. Call setkey(x,...) first, see ?setkey.
One of the most notable differences between a data.table and a data.frame is that the column names must be used without quotation marks in a data.table. In a data.frame object named pooldf
we would choose the column named “weight” by
pooldf[ , "weight"]
which is equivalent to
pooldf$weight
The data.table class will allow the following. Look, no quotes!:
poolballs[ , weight]
[1] 45.36666 45.58336 44.51368 44.16628 44.91709 45.43478 45.88063
[8] 44.46412 45.00104 46.36696 46.43844 45.92276 46.89205 46.53843
[15] 46.03637
There is also a different style to extract a colletion of columns from a data.table object. In the case of a data.frame, we choose two columns by name in this way
pooldf[ , c("weight", "diameter")]
while the data.table does not expect either a vector or quoted strings as arguments. The data.table expects a list.
poolballs[ , list(weight, diameter)]
or the shorthand .(
for the list. This syntax is adopted from the package dplyr by Hadley Wickham.
poolballs[ , .(weight, diameter)]
weight diameter
1: 45.36666 3.0
2: 45.58336 2.9
3: 44.51368 3.1
4: 44.16628 3.0
5: 44.91709 2.9
6: 45.43478 3.1
7: 45.88063 3.0
8: 44.46412 2.9
9: 45.00104 3.1
10: 46.36696 3.0
11: 46.43844 2.9
12: 45.92276 3.1
13: 46.89205 3.0
14: 46.53843 2.9
15: 46.03637 3.1
Grabing columns by numbered index wors as it would with data.frame, but only in recent version of data.table.
poolballs[, 1]
number
1: 1
2: 2
3: 3
4: 4
5: 5
6: 6
7: 7
8: 8
9: 9
10: 10
11: 11
12: 12
13: 13
14: 14
15: 15
Data frames and data.tables allow access by column number, but there are good practical arguments against that kind of coding. Fewer unexpected results happen if we access the columns by their full name. Using variable names instead of column index numbers encourages flexible code development and reduces the number of user errors (i.e. when importing several datasets and one of the columns is not in the correct order a column name call will return the correct data while calling column 5 may return errors or include data from the incorrect variable).
The first column is named “number” in this example.
poolballs[, number]
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
The columns in a data.table can be renamed with the setnames
function from the data.table package. If DT
is a data.table, we could change the names of the columns V1
, V2
, and V3
as follows.
setnames(DT,c("V1","V2","V3"), c("name1", "name2", "name3"))
Using data.frame DF changing columns names would required the following syntax
colnames(DT) <- c("name1", "name2", "name3")
This leaves the programmer or code reader blind to the previous colnames that are being overwritten. The data.frame approach rewrites the entire data.frame with the new colnames to memory. This highlights the major efficiency with data.table and it’s conservative use of memory. We’ll explore the processing speed differences between these two structures later in this guide.
In an ordinary R data.frame, adding a variable causes some seemingly unnecessary re-calculations. Adding a column causes a complete rewrite of all of the data.frame’s columns. It seems more efficient to leave the old columns where they were and simply add a new column. This seemingly obvious efficiency was one of the motivating ideas behind the data.table project in the first place.
The syntax of data.table allows the assignment of one new variable in DT
with
DT[ , Xnew := any_function(x)]
The symbol :=
indicates we want to create a variable within DT
. any_function
represents any R
function acting on a variable x
, that already exists in DT
. The new variable name Xnew
must be a valid R
column name. Here we put a comma inside the square brackets for clarity, [ ,
, to remind ourselves that we are creating this new variable for all rows. Then define the variable name and use :=
to define the variable’s value based on a calculation of other DT
variables. We can manufacture several variables at once using a list syntax that will be illustrated below.
Here we will define a new categorical variable in poolballs
data.table called stripes
, which indicates if a certain ball is a solid
or stripe
. We assign the values with the operator :=
.
poolballs[ , stripes := ifelse(number%%2, 'solid', 'stripe')]
head(poolballs)
We used the R
base ifelse
function that lets us check a logical condition and assign one of two values if our first argument is true or false. The argument we use is modulo(%%) 2 of the number
variable. So every other ball will be set as a solid or strip as is typical in billiard ball design.
It is possible to do any R computation We’ll use the head
function to look at just the first 5 rows.
poolballs[ , density := weight/(1.333*pi*diameter^3)]
head(poolballs,5)
number weight diameter stripes density
1: 1 45.36666 3.0 solid 0.4012296
2: 2 45.58336 2.9 stripe 0.4463056
3: 3 44.51368 3.1 solid 0.3568029
4: 4 44.16628 3.0 stripe 0.3906133
5: 5 44.91709 2.9 solid 0.4397821
You can generate multiple new variables in one line by using list()
or the shorthand .()
. You may noticed how often shorthand is used along with data.table as those building and using this package tend to write code as breif as possible. Below we use :=
to assign multiple variables in one line and add them to the dataset.
poolballs[ , `:=` (Id = sample(1:250, size=15), pickles = 'whatever')]
head(poolballs,5)
number weight diameter stripes density Id pickles
1: 1 45.36666 3.0 solid 0.4012296 231 whatever
2: 2 45.58336 2.9 stripe 0.4463056 52 whatever
3: 3 44.51368 3.1 solid 0.3568029 136 whatever
4: 4 44.16628 3.0 stripe 0.3906133 215 whatever
5: 5 44.91709 2.9 solid 0.4397821 229 whatever
Assigning new variables in data.frames requires the $
and assignment object <-
. i.e. DF$NewVarName <- DF$weight/(1.333*pi*DF$diameter^3)
This isn’t much more work than data.table but you don’t have the convenience of the by =
statement to create calculations based on a grouping variable. You may likely have to include a for()
loop to get a similar result. See below for examples of the advantages of the by =
statement.
Unwanted or erroneous variables can be eliminated from the data by using NULL
. We assign this value to the variable names in the j expression
poolballs[ , c("Id", "pickles") := NULL]
This line simply removes the data from data.table memory. Doing the equivalent with data.frame would re-write the entire object to memory, sans the deleted variables. Here again is an example of efficient operation that is going on behind the scenes with data.table.
setkey
is a function that will arrange the data based on the values of your sorting variable.
setkey(poolballs, diameter)
poolballs
number weight diameter stripes density
1: 2 45.58336 2.9 stripe 0.4463056
2: 5 44.91709 2.9 solid 0.4397821
3: 8 44.46412 2.9 stripe 0.4353472
4: 11 46.43844 2.9 solid 0.4546776
5: 14 46.53843 2.9 stripe 0.4556566
6: 1 45.36666 3.0 solid 0.4012296
7: 4 44.16628 3.0 stripe 0.3906133
8: 7 45.88063 3.0 solid 0.4057753
9: 10 46.36696 3.0 stripe 0.4100764
10: 13 46.89205 3.0 solid 0.4147205
11: 3 44.51368 3.1 solid 0.3568029
12: 6 45.43478 3.1 stripe 0.3641860
13: 9 45.00104 3.1 solid 0.3607094
14: 12 45.92276 3.1 stripe 0.3680975
15: 15 46.03637 3.1 solid 0.3690082
The setkey also allows a lexicographic sort according to several varaibles. Here the data will first be sorted by stripes
and then by weight
. After using setkey the data.table printout reflects the sort.
setkey(poolballs, stripes, weight)
poolballs
number weight diameter stripes density
1: 3 44.51368 3.1 solid 0.3568029
2: 5 44.91709 2.9 solid 0.4397821
3: 9 45.00104 3.1 solid 0.3607094
4: 1 45.36666 3.0 solid 0.4012296
5: 7 45.88063 3.0 solid 0.4057753
6: 15 46.03637 3.1 solid 0.3690082
7: 11 46.43844 2.9 solid 0.4546776
8: 13 46.89205 3.0 solid 0.4147205
9: 4 44.16628 3.0 stripe 0.3906133
10: 8 44.46412 2.9 stripe 0.4353472
11: 6 45.43478 3.1 stripe 0.3641860
12: 2 45.58336 2.9 stripe 0.4463056
13: 12 45.92276 3.1 stripe 0.3680975
14: 10 46.36696 3.0 stripe 0.4100764
15: 14 46.53843 2.9 stripe 0.4556566
We start with a simple benefit of the data.table design. It is very easy to calculate the average weight of pool balls within subgroups. This works best if we set the key as stripes
setkey(poolballs, stripes)
And then we will ask for the average weight of pool balls within the different values of stripes:
poolballs[ , mean(weight), by = stripes]
stripes V1
1: solid 45.63075
2: stripe 45.49667
Note that the newly created data.table object has one row for solids and one row for stripes, while the newly created average column is simply labeled V1
. We are allowed to name that column if we wrap the calculation in a list, as in
poolballs[ , list(weight.mean = mean(weight)), by = stripes]
stripes weight.mean
1: solid 45.63075
2: stripe 45.49667
The aggregation by
stripes can be combined with a row-selection of poolballs. For example, we want the mean of balls that are numbered one through four, this will work:
poolballs[number < 5 , .(weight.mean = mean(weight)), by = stripes]
stripes weight.mean
1: solid 44.94017
2: stripe 44.87482
Again we have used the shorthand .(
instead of list()
. Many who have worked with R data frames have used functions like tapply
, by
, and aggregate
to achieve the same purpose. All of these will summarize information about subgroups in a data set. The output from these things is slightly different, but it is mostly understandable.
pooldf$stripes <- ifelse(pooldf$number %% 2, 'solid', 'stripe')
tapply(pooldf$weight, list(pooldf$stripes), mean)
solid stripe
45.63075 45.49667
or
aggregate(pooldf$weight, by = list(diameter = pooldf$stripes), mean)
diameter x
1 solid 45.63075
2 stripe 45.49667
The return objects are structured differently, but the numerical values are the same.
It may be a little tricky if one wants to put the group means back “onto” the original data frame. It is easy with data.table, however, if we use :=
.
poolballs[ , weight.mean := mean(weight), by = stripes]
From there, it is easy to calculate “deviations about the mean within groups”.
poolballs[ , weight.dev := weight - weight.mean]
head(poolballs)
number weight diameter stripes density weight.mean weight.dev
1: 3 44.51368 3.1 solid 0.3568029 45.63075 -1.1170630
2: 5 44.91709 2.9 solid 0.4397821 45.63075 -0.7136564
3: 9 45.00104 3.1 solid 0.3607094 45.63075 -0.6297099
4: 1 45.36666 3.0 solid 0.4012296 45.63075 -0.2640876
5: 7 45.88063 3.0 solid 0.4057753 45.63075 0.2498869
6: 15 46.03637 3.1 solid 0.3690082 45.63075 0.4056294
i
, j
- expressions and by = k
Here is more detail on the data.table syntax. There are special symbolic variables that are used to simplify calculations.
.N
is a symbol referring to the number of rows within each value of the grouping variable
.SD
is a symbol used to refer to all of the columns.
We will explore these in this section.
You may want various statistics or calculations of your data but only of certain subsets. You can use an i
expression to define rows you want. In addition, a grouping variable by
can combine rows. If you have several different groups of rows you’d like to do calculations on, using a j
expression is useful.
These useages generally create new data tables, including, for example, the average values of the variables for each subgroup. In these cases, we can include row selection in the i-expression
, column selection with the .(j-expression)
, as well as a k-expression
that represents a grouping variable.
Here is an example of finding the total weight of stripes and solid balls. We will define the total weight variable as tot_weight
but notice this just defines the var in the temporary output and doesn’t define a new permanent variable in poolballs
. For defining a permanent addition to poolballs
we’d use :=
.
poolballs[ , .(weight_tot = sum(weight)), by = stripes]
stripes weight_tot
1: solid 365.0460
2: stripe 318.4767
What about the average weight of a solid or stripe ball? We can use a handy trick: the .N
value, which returns the number of rows in each grouping.
poolballs[ , .(weight_avg = sum(weight)/.N), by = stripes]
stripes weight_avg
1: solid 45.63075
2: stripe 45.49667
Verify that is the same as R’s built in mean function that was used above.
By putting multiple variables, stripes
and diameter
in the by
statement, you create groups that match up in every combination of unique stripes
and diameter
values.
poolballs[ , .(weight.mean = mean(weight)), by = .(stripes,diameter)]
stripes diameter weight.mean
1: solid 3.1 45.18370
2: solid 2.9 45.67776
3: solid 3.0 46.04645
4: stripe 3.0 45.26662
5: stripe 2.9 45.52864
6: stripe 3.1 45.67877
The setkey
function makes some short hand possible for row extraction. Suppose we set the sorting key as stripes
.
setkey(poolballs, stripes)
After that, then when we specify a value in the i
argument, then that value is matched against the key variable. Note it is not necessary to specify that the value “solid” is going to be found in the variable named stripes
in the following.
poolballs['solid', ]
number weight diameter stripes density weight.mean weight.dev
1: 3 44.51368 3.1 solid 0.3568029 45.63075 -1.1170630
2: 5 44.91709 2.9 solid 0.4397821 45.63075 -0.7136564
3: 9 45.00104 3.1 solid 0.3607094 45.63075 -0.6297099
4: 1 45.36666 3.0 solid 0.4012296 45.63075 -0.2640876
5: 7 45.88063 3.0 solid 0.4057753 45.63075 0.2498869
6: 15 46.03637 3.1 solid 0.3690082 45.63075 0.4056294
7: 11 46.43844 2.9 solid 0.4546776 45.63075 0.8076923
8: 13 46.89205 3.0 solid 0.4147205 45.63075 1.2613083
The output includes all of the balls. The data.table package includes a keyword that can be used for the k
argument, mult
. That is set as “all” by default, but it can be set as “first” or “last” as well:
poolballs['solid', mult = 'first']
number weight diameter stripes density weight.mean weight.dev
1: 3 44.51368 3.1 solid 0.3568029 45.63075 -1.117063
poolballs['solid', mult = 'last']
number weight diameter stripes density weight.mean weight.dev
1: 13 46.89205 3 solid 0.4147205 45.63075 1.261308
There may be instances where you’d like to compute something using an entire subset of the data.table instead of just one or two variables. When using by =
the .SD
reference will supply all variables besides the one in the by
statement. Below we return the variance for all columns of poolballs by stripes grouping. This is handy for passing the data to a function.
poolballs[, lapply(.SD, var), by = stripes]
stripes number weight diameter density weight.mean
1: solid 24.00000 0.6705223 0.006964286 0.001313305 0
2: stripe 18.66667 0.8119409 0.008095238 0.001379280 0
weight.dev
1: 0.6705223
2: 0.8119409
You may want this calculation for all but a few variables in a dataset. You can define the scope of .SD
by using .SDcols
after the by
statement. Then you may have the option of using the whole dataset but could allow a user to define which columns to include.
poolballs[ , lapply(.SD, var), by = stripes, .SDcols = c("weight", "diameter")]
stripes weight diameter
1: solid 0.6705223 0.006964286
2: stripe 0.8119409 0.008095238
This section is partially adapted from the data.table vignette that is available with the package. Run vignette("datatable-intro-vignette")
.
First we will create some hyyuuuuge datasets by repeating the poolball
dataset along with some new variables: manufacturers
and price
. The original data and manufacturers
will be recycled until we’ve repeated enough poolballs sets to meet the number of prices we’ve just generated, 15e6
DF <- data.frame(rep(poolballs),
manufacturers=c('James', 'Smith', 'HardBalls', 'Billiard Buddies'),
price = round(rnorm(15e6, 1.65, 2.06), 2))
DT <- data.table(DF)
After copying our DF
into a data.table, DT
, we can now compare the timing of some operations with these two identical sets of data. Defining your dataset as a data.table or data.frame should take equivalent time.
For a data.table we first need to use setkey()
to sort or order the data. This takes a little time but will be well worth the speed increase in subsequent calculations. Below data will first be sorted by manufacturers
and then sorted by price
second. For these sorted columns, we can pass a list of values that to get the rows we want.
system.time(setkey(DT, manufacturers, price)) #initial sorting needed for DT
user system elapsed
9.663 1.379 2.791
system.time(ans1 <- DT[.("Smith",1.69)])
user system elapsed
0.002 0.000 0.003
This subset assignment happens almost instantly. See how much slower DF
would be doing the same thing. A few subset selections and it would take almost as long as doing the multi variable sorting.
system.time(ans2 <- DF[DF$manufacturers == c("Smith") & DF$price==1.69,])
user system elapsed
0.21 0.02 0.23
Note the differences in elapsed time. Let’s check that we got the identical rows both ways (tail
shows us the last rows, here I’ve specified the last 3)
tail(ans1,3)
number weight diameter stripes density weight.mean weight.dev
1: 6 45.43478 3.1 stripe 0.3641860 45.49667 -0.06189089
2: 15 46.03637 3.1 solid 0.3690082 45.63075 0.40562938
3: 14 46.53843 2.9 stripe 0.4556566 45.49667 1.04175713
manufacturers price
1: Smith 1.69
2: Smith 1.69
3: Smith 1.69
tail(ans2,3)
number weight diameter stripes density weight.mean
14995586 6 45.43478 3.1 stripe 0.3641860 45.49667
14996466 15 46.03637 3.1 solid 0.3690082 45.63075
14998230 14 46.53843 2.9 stripe 0.4556566 45.49667
weight.dev manufacturers price
14995586 -0.06189089 Smith 1.69
14996466 0.40562938 Smith 1.69
14998230 1.04175713 Smith 1.69
Here we find the average price of a poolball by manufacturer. Using data.table we get ans3
pretty quickly
system.time(ans3 <- DT[,.(manu_mean = mean(price)), by = manufacturers])
user system elapsed
0.245 0.020 0.265
ans3
manufacturers manu_mean
1: Billiard Buddies 1.648792
2: HardBalls 1.651874
3: James 1.649076
4: Smith 1.650318
Using the data.frame is much more cumbersome so we will just compute the mean for two manufacturers. Again, note the difference in elapsed time.
system.time(Smithmean <- mean(DF[DF$manufacturers == "Smith",]$price),
Jamesmean <- mean(DF[DF$manufacturers == "James",]$price))
user system elapsed
0.461 0.133 0.594
Now we can just check they are the same as above.
Jamesmean
Smithmean
Using aggregate we can get the same results from one line with the DF but the results come at a big computing cost.
system.time(ans4 <- aggregate(DF$price, by = list(Manu_mean = DF$manufacturers), mean))
ans4
user system elapsed
4.426 0.640 5.066
Manu_mean x
1 Billiard Buddies 1.648792
2 HardBalls 1.651874
3 James 1.649076
4 Smith 1.650318
Using tapply
with Data.Frame’s is much more efficient.
system.time(ans5 <- tapply(DF$price, list(Manu_mean = DF$manufacturers), mean))
ans5
user system elapsed
0.147 0.083 0.230
Manu_mean
Billiard Buddies HardBalls James Smith
1.648792 1.651874 1.649076 1.650318
You can execute multiple calculations in one j-expression using { }
. Here we can even create a temporary variable, cost
, and compute another value using the temporary variable. Here we find the profit by manufacturer and stripes, assuming the cost is 3 cents per gram weight. Notice that subsequent calculations should be separated with a ;
not a ,
. This approach will make more complex calculations easier to read and take much less time.
system.time(DT[ , {cost = weight*0.03; profit = mean(price-cost)},
by = .(stripes, manufacturers)])
user system elapsed
0.479 0.080 0.538
stripes manufacturers V1
1: stripe Billiard Buddies 0.2835304
2: solid Billiard Buddies 0.2801864
3: solid HardBalls 0.2829155
4: stripe HardBalls 0.2870156
5: solid James 0.2809137
6: stripe James 0.2833062
7: solid Smith 0.2801392
8: stripe Smith 0.2868533
This specific example can be be done in one line with data.frame, using tapply
. If you were making multiple processing steps, using multiple functions tapply may not work or you’d have to start imbedding them and it would get messy. But for this example, if you’re a fan of data.frames, $
and tapply
then you can do what data.table does too.
system.time(
tapply((DF$price-DF$weight*0.03), list(DF$stripes,DF$manufacturers),
mean))
user system elapsed
0.294 0.128 0.422
Another way to accomplish a whole lot in one line is by ‘Chaining’. This method utilizes the fact that every DT[ ]
call returns the data or subset of data defined by the i
, j
and k
expressions. With this new subset you may again call new i
, j
and k
expressions but adding another [i,j,k=by ]
after the first DT[i,j,k=by]
.
The example below attempts to find the most expensive average cost poolball per manufacturer.
setkey(DT,number, stripes, manufacturers)
DT[,.(meancost = mean(price-weight*0.03)), by=.(stripes, manufacturers)][ ,
.(number, stripes,expensive = max(meancost)),by = manufacturers]
This returns an error
We attempted to get the number
of the ball that was most expensive but because we are looking for the number
variable in the second [ ]
it isn’t there. In the second [ ]
we are working with a subset of the original data.table, generated by the first [ ]
, which has excluded number
and only includes stripes
, manufactureres
and meancost
. We get the most expensive ball for each manufacturer/stripes below.
DT[,.(meancost = mean(price-weight*0.03)), by=.(manufacturers,stripes)][ ,
.(expensive = max(meancost)), by = manufacturers]
manufacturers expensive
1: Billiard Buddies 0.2835304
2: HardBalls 0.2870156
3: James 0.2833062
4: Smith 0.2868533
Data.table allows for some very intuitive indexing and subset calculations that take far less time than data frame on very large datasets. The approaches shown here will aid in flexible, easier to read programming that can be accomplished in fewer lines of code. The same calculations using data.frame will often take much longer. Functions like sorting and summarizing data are done much faster. Complex calculations of datasets are easily acquired and can often be done in one line with chaining or calling functions in the j-expressions. It’s cool, try it.
Vignettes included with the data.table package (see the list in output from help(package = "data.table")
StackOverflow has a very active discussion area for data.table: http://stackoverflow.com/questions/tagged/data.table
Data Analysis the Data.Table Way: The official Cheat Sheet for the DataCamp course cheat sheet
Advanced tips and tricks with data.table (by Andrew Brooks) advanced data.table
R version 3.5.0 (2018-04-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.1 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] data.table_1.11.4 stationery_0.85
loaded via a namespace (and not attached):
[1] Rcpp_0.12.17 knitr_1.20 magrittr_1.5 kutils_1.46
[5] mnormt_1.5-5 pbivnorm_0.6.0 xtable_1.8-2 stringr_1.3.1
[9] plyr_1.8.4 tools_3.5.0 htmltools_0.3.6 yaml_2.1.19
[13] digest_0.6.15 rprojroot_1.3-2 lavaan_0.6-1 zip_1.0.0
[17] evaluate_0.10.1 rmarkdown_1.10 openxlsx_4.1.0 stringi_1.2.3
[21] compiler_3.5.0 backports_1.1.2 stats4_3.5.0 foreign_0.8-70
Available under Created Commons license 3.0