--- title: "Data Table Guide" subtitle: "Basic Usage and Advanced Topics" author: - name: Jeremy Burnison affiliation: CRMDA, KU email: jburnison@ku.edu - name: Paul Johnson affiliation: CRMDA, KU email: pauljohn@ku.edu advertise: > Please visit [http://crmda.ku.edu/guides](http://crmda.ku.edu/guides) keywords: guide, data.table, SQL abstract: > This is an introduction to the data.table package in R. data.table is a high performance replacement for R's built-in data.frame. There are significant differences in the syntax for usage. Proponents of data.table prefer its simpler notation and ease of use. While the style differences are noteworthy, for most who consider adopting data.table, the major advantage of data.table over the standard data.frame format is speed in processing. This makes data.table desirable for 'big data' analysis. While data.table is an alternative to data.frame, it is intended to be compatible with most functions that expect to receive a data.frame object. That is to say, data.frame-centric code will work with a data.table, but it will run faster! Included in this tutorial are examples of indexing and aggregating data variables using the data.table syntax. The first example dataset used is not large so that you can see what happens to the data with the syntax used. checked_by: "Paul Johnson" Note to Authors: please_dont_change_the_next 4 lines! date: "`r format(Sys.time(), '%b. %e %Y')`" output: stationery::crmda_html_document: toc: true toc_depth: 2 highlight: haddock theme: default css: theme/kutils.css template: theme/guide-boilerplate.html guide_number: 40 logoleft: theme/logoleft.png logoright: theme/logo-vert.png --- ```{r setup, include=FALSE} ##This Invisible Chunk is required in all CRMDA documents outdir <- paste0("tmpout") if (!file.exists(outdir)) dir.create(outdir, recursive = TRUE) knitr::opts_chunk$set(echo=TRUE, comment=NA, fig.path=paste0(outdir, "/p-")) opts.orig <- options() par.orig <- par(no.readonly=TRUE) options(width = 70) ``` ```{r themecopy, include = FALSE} library(stationery) ## If theme directory does not have required images or TeX files ## we need to retrieve them and put them in "theme" directory. logos <- c(logoleft = "logoleft.png", logoright = "logo-vert.png") getFiles(logos, pkg = "crmda") ``` ## Install and Load the Data.Table package in R The data.table package is available on CRAN, the default package repository. Installation can be accomplished with `install.packages("data.table")`. Load the data.table package into a session, just like any other package. The data.table package prints out a brief advertisement, as we will see. ```{r load_package} library(data.table) ``` For replication purposes, we will set the random seed as ```{r seed} set.seed(2345321) ``` The R session information is included the end of this guide. ## Syntax: Pooltable Example This section will introduce the syntax used to generate a new data.table, new variables, indexing the dataset, and creating summaries of subsets and the whole dataset. ### Generate a data.table Here we will create a dataset describing a set of billiard balls. The balls are numbered 1 through 15, the weight is generated by a random normal distribution, and the diameter is set by discrete values. Here we only define 3 values for the diameter but recycling will fill the data.table. By defining number and weight with 15 values we have set the length of the data.table. Defining another variable in instantiation of a data table with fewer than this 15 length will signal to data.table to recycle these numbers until 15 values are entered for that variable. You can see these three numbers repeated 5 times to fill up the diameter variable with 15 values. Try setting number to 1:16 or another range. The syntax to create a data.table is identical to the syntax that would be required for a data.frame, except that we replace `data.frame` with `data.table`. ```{r make_pb_dt} poolballs <- data.table( number = 1:15, weight = rnorm(15, 45.7, 0.8), diameter = c(3, 2.9, 3.1) #shows recyling ) poolballs ``` For comparison with the data.table version of the billiard ball data, we'll make a data.frame that has the same information. This is called `pooldf` for clarity. ```{r make_pb_df} set.seed(2345321) pooldf <- data.frame( number = 1:15, weight = rnorm(15, 45.7, 0.8), diameter = c(3, 2.9, 3.1) ) pooldf ``` We could achieve the same effect by using the function `as.data.frame` supplied by the data.table package. Note that the data.table object seems to have column names and row names in the same way that a data.frame would: ```{r colnam} colnames(poolballs) rownames(poolballs) ``` The data.table *seems to* have column and row names as usual, but they cannot be used in the ways we expect. This is surely one of the most interesting (and frustrating) differences between data.tables and data.frames. This difference, which we will illustrate with examples, is a major source of confusion among new data.table users. ### Data Table indexing Our poolballs object is a data.table. Using the syntax `poolballs[i, j, by = k]` we can define what rows -- `i` -- and columns -- `j` -- we want from our data. The third argument has many uses. Here, we have a third argument `by`, which signals that we want to group the results according to a variable `k`. There are many other keywords we can use in that third position, such as `mult` and `with`. #### Grabbing specific rows {.bs-callout .bs-callout-orange} With the data.table, one can ask for rows by row number. We can retrieve rows 1 through 3 in the expected way ```{r grab_rows1} poolballs[1:3,] ``` We can achieve the same effect by choosing rows by a logical condition. In this case, we want the rows for which the value of the variable `number` is less than 4. The following is equivalent. ```{r grab_rows2} poolballs[number < 4,] ``` Users who are familiar with data.frames will be surprised that the column name `number` does not need to be quoted. The data.frame equivalent request would be rather more complicated in appearance: ```{r} pooldf[pooldf$number < 4, ] ``` These examples indicate that the data.table's row-choosing square brackets allow either integer row indices or an expression to define a logical condition. The syntax for the data.frame alternative is either more clear or more tedious, depending on your taste. The fact the data.table `poolballs` is able to find the column `number` within itself is a significant syntactical difference. Choosing elements by rowname causes an error in a data.table. Recall that the rownames were retrieved above, they are "1" through "15", as characters. ```{r, eval=F} > poolballs["1", ] ``` ``` Error in `[.data.table`(poolballs, "1", ) : When i is a data.table (or character vector), x must be keyed (i.e. sorted, and, marked as sorted) so data.table knows which columns to join to and take advantage of x being sorted. Call setkey(x,...) first, see ?setkey. ``` #### Grabbing specific variables (columns) {.bs-callout .bs-callout-blue} One of the most notable differences between a data.table and a data.frame is that the column names must be used without quotation marks in a data.table. In a data.frame object named `pooldf` we would choose the column named "weight" by ```{r, eval=F} pooldf[ , "weight"] ``` which is equivalent to ```{r, eval=F} pooldf$weight ``` The data.table class will allow the following. Look, *no quotes*!: ```{r} poolballs[ , weight] ``` There is also a different style to extract a colletion of columns from a data.table object. In the case of a data.frame, we choose two columns by name in this way ```{r, eval=F} pooldf[ , c("weight", "diameter")] ``` while the data.table does not expect either a vector or quoted strings as arguments. The data.table expects a list. ```{r, eval=F} poolballs[ , list(weight, diameter)] ``` or the shorthand `.(` for the list. This syntax is adopted from the package dplyr by Hadley Wickham. ```{r} poolballs[ , .(weight, diameter)] ``` Grabing columns by numbered index wors as it would with data.frame, but only in recent version of data.table. ```{r grab_cols1} poolballs[, 1] ``` Data frames and data.tables allow access by column number, but there are good practical arguments against that kind of coding. Fewer unexpected results happen if we access the columns by their full name. Using variable names instead of column index numbers encourages flexible code development and reduces the number of user errors (i.e. when importing several datasets and one of the columns is not in the correct order a column name call will return the correct data while calling column 5 may return errors or include data from the incorrect variable). The first column is named "number" in this example. ```{r grab_cols2} poolballs[, number] ``` The columns in a data.table can be renamed with the `setnames` function from the data.table package. If `DT` is a data.table, we could change the names of the columns `V1`, `V2`, and `V3` as follows. ```{r, eval=F} setnames(DT,c("V1","V2","V3"), c("name1", "name2", "name3")) ``` Using data.frame DF changing columns names would required the following syntax ```{r, eval=F} colnames(DT) <- c("name1", "name2", "name3") ``` This leaves the programmer or code reader blind to the previous colnames that are being overwritten. The data.frame approach rewrites the entire data.frame with the new colnames to memory. This highlights the major efficiency with data.table and it's conservative use of memory. We'll explore the processing speed differences between these two structures later in this guide. ### Generating new variables within the data.table In an ordinary R data.frame, adding a variable causes some seemingly unnecessary re-calculations. Adding a column causes a complete rewrite of all of the data.frame's columns. It seems more efficient to leave the old columns where they were and simply add a new column. This seemingly obvious efficiency was one of the motivating ideas behind the data.table project in the first place. #### New assignment symbol ":=" {.bs-callout .bs-callout-red} The syntax of data.table allows the assignment of one new variable in `DT` with ```{r, eval=F} DT[ , Xnew := any_function(x)] ``` The symbol `:=` indicates we want to create a variable within `DT`. `any_function` represents any `R` function acting on a variable `x`, that already exists in `DT`. The new variable name `Xnew` must be a valid `R` column name. Here we put a comma inside the square brackets for clarity, `[ ,` , to remind ourselves that we are creating this new variable for all rows. Then define the variable name and use `:=` to define the variable's value based on a calculation of other `DT` variables. We can manufacture several variables at once using a list syntax that will be illustrated below. #### Example creating a new variable Here we will define a new categorical variable in `poolballs` data.table called `stripes`, which indicates if a certain ball is a `solid` or `stripe`. We assign the values with the operator `:=`. ```{r new_stripes, results='hide'} poolballs[ , stripes := ifelse(number%%2, 'solid', 'stripe')] head(poolballs) ``` We used the `R` base `ifelse` function that lets us check a logical condition and assign one of two values if our first argument is true or false. The argument we use is modulo(*%%*) 2 of the `number` variable. So every other ball will be set as a solid or strip as is typical in billiard ball design. It is possible to do any R computation We'll use the `head` function to look at just the first 5 rows. ```{r new_density, results ='hide' } poolballs[ , density := weight/(1.333*pi*diameter^3)] ``` ```{r} head(poolballs,5) ``` You can generate multiple new variables in one line by using `list()` or the shorthand `.()`. You may noticed how often shorthand is used along with data.table as those building and using this package tend to write code as breif as possible. Below we use `:=` to assign multiple variables in one line and add them to the dataset. ```{r, results = 'hide'} poolballs[ , `:=` (Id = sample(1:250, size=15), pickles = 'whatever')] ``` ```{r} head(poolballs,5) ``` Assigning new variables in data.frames requires the `$` and assignment object `<-`. i.e. `DF$NewVarName <- DF$weight/(1.333*pi*DF$diameter^3)` This isn't much more work than data.table but you don't have the convenience of the `by =` statement to create calculations based on a grouping variable. You may likely have to include a `for()` loop to get a similar result. See below for examples of the advantages of the `by =` statement. #### Using ":=" to removie variables from a data.table {.bs-callout .bs-callout-green} Unwanted or erroneous variables can be eliminated from the data by using `NULL`. We assign this value to the variable names in the **j** expression ```{r} poolballs[ , c("Id", "pickles") := NULL] ``` This line simply removes the data from data.table memory. Doing the equivalent with data.frame would re-write the entire object to memory, sans the deleted variables. Here again is an example of efficient operation that is going on behind the scenes with data.table. ### Sorting data `setkey` is a function that will arrange the data based on the values of your sorting variable. ```{r} setkey(poolballs, diameter) poolballs ``` The setkey also allows a *lexicographic* sort according to several varaibles. Here the data will first be sorted by `stripes` and then by `weight`. After using setkey the data.table printout reflects the sort. ```{r sort1} setkey(poolballs, stripes, weight) poolballs ``` #### Aggregating variables with "by =" {.bs-callout .bs-callout-gray} We start with a simple benefit of the data.table design. It is very easy to calculate the average weight of pool balls within subgroups. This works best if we set the key as `stripes` ```{r sort2} setkey(poolballs, stripes) ``` And then we will ask for the average weight of pool balls within the different values of stripes: ```{r rows_weight1} poolballs[ , mean(weight), by = stripes] ``` Note that the newly created data.table object has one row for solids and one row for stripes, while the newly created average column is simply labeled `V1`. We are allowed to name that column if we wrap the calculation in a list, as in ```{r rows_weight2} poolballs[ , list(weight.mean = mean(weight)), by = stripes] ``` The aggregation `by` stripes can be combined with a row-selection of poolballs. For example, we want the mean of balls that are numbered one through four, this will work: ```{r rows_weight4} poolballs[number < 5 , .(weight.mean = mean(weight)), by = stripes] ``` Again we have used the shorthand `.(` instead of `list()`. Many who have worked with R data frames have used functions like `tapply`, `by`, and `aggregate` to achieve the same purpose. All of these will summarize information about subgroups in a data set. The output from these things is slightly different, but it is mostly understandable. ```{r} pooldf$stripes <- ifelse(pooldf$number %% 2, 'solid', 'stripe') tapply(pooldf$weight, list(pooldf$stripes), mean) ``` or ```{r} aggregate(pooldf$weight, by = list(diameter = pooldf$stripes), mean) ``` The return objects are structured differently, but the numerical values are the same. It may be a little tricky if one wants to put the group means back "onto" the original data frame. It is easy with data.table, however, if we use `:=`. ```{r rows_weight10} poolballs[ , weight.mean := mean(weight), by = stripes] ``` From there, it is easy to calculate "deviations about the mean within groups". ```{r rowsweight11} poolballs[ , weight.dev := weight - weight.mean] ``` ```{r} head(poolballs) ``` ### More on using `i`, `j` - expressions and `by = k` Here is more detail on the data.table syntax. There are special symbolic variables that are used to simplify calculations. 1. `.N` is a symbol referring to the number of rows within each value of the grouping variable 2. `.SD` is a symbol used to refer to all of the columns. We will explore these in this section. #### Accessing blocks of cases by row index {.bs-callout .bs-callout-orange} You may want various statistics or calculations of your data but only of certain subsets. You can use an `i` expression to define rows you want. In addition, a grouping variable `by` can combine rows. If you have several different groups of rows you'd like to do calculations on, using a `j` expression is useful. These useages generally create new data tables, including, for example, the average values of the variables for each subgroup. In these cases, we can include row selection in the `i-expression`, column selection with the `.(j-expression)`, as well as a `k-expression` that represents a grouping variable. #### Example with stipes and solids {.bs-callout .bs-callout-blue} Here is an example of finding the total weight of stripes and solid balls. We will define the total weight variable as `tot_weight` but notice this just defines the var in the temporary output and doesn't define a new permanent variable in `poolballs`. For defining a permanent addition to `poolballs` we'd use `:=`. ```{r total_weight} poolballs[ , .(weight_tot = sum(weight)), by = stripes] ``` What about the average weight of a solid or stripe ball? We can use a handy trick: the `.N` value, which returns the number of rows in each grouping. ```{r mean_calc} poolballs[ , .(weight_avg = sum(weight)/.N), by = stripes] ``` Verify that is the same as R's built in mean function that was used above. By putting multiple variables, `stripes` and `diameter` in the `by` statement, you create groups that match up in every combination of unique `stripes` and `diameter` values. ```{r mean_weight} poolballs[ , .(weight.mean = mean(weight)), by = .(stripes,diameter)] ``` The `setkey` function makes some short hand possible for row extraction. Suppose we set the sorting key as `stripes`. ```{r value_index1} setkey(poolballs, stripes) ``` After that, then when we specify a value in the `i` argument, then that value is matched against the key variable. Note it is not necessary to specify that the value "solid" is going to be found in the variable named `stripes` in the following. ```{r value_index2} poolballs['solid', ] ``` The output includes all of the balls. The data.table package includes a keyword that can be used for the `k` argument, `mult`. That is set as "all" by default, but it can be set as "first" or "last" as well: ```{r value_index3} poolballs['solid', mult = 'first'] ``` ```{r value_index4} poolballs['solid', mult = 'last'] ``` ### Calculations for the entire data set: .SD There may be instances where you'd like to compute something using an entire subset of the data.table instead of just one or two variables. When using `by = ` the `.SD` reference will supply all variables besides the one in the `by` statement. Below we return the variance for all columns of poolballs by stripes grouping. This is handy for passing the data to a function. ```{r} poolballs[, lapply(.SD, var), by = stripes] ``` You may want this calculation for all but a few variables in a dataset. You can define the scope of `.SD` by using `.SDcols` after the `by` statement. Then you may have the option of using the whole dataset but could allow a user to define which columns to include. ```{r} poolballs[ , lapply(.SD, var), by = stripes, .SDcols = c("weight", "diameter")] ``` ## Speed DF vs. DT {.tabset .tabset-fade} This section is partially adapted from the data.table vignette that is available with the package. Run `vignette("datatable-intro-vignette")`. #### Setting up a Huge data example {.bs-callout .bs-callout-red} First we will create some *hyyuuuuge* datasets by repeating the `poolball` dataset along with some new variables: `manufacturers` and `price`. The original data and `manufacturers` will be recycled until we've repeated enough poolballs sets to meet the number of prices we've just generated, **15e6** ```{r, echo=TRUE, include = TRUE, results='hide', eval=TRUE} DF <- data.frame(rep(poolballs), manufacturers=c('James', 'Smith', 'HardBalls', 'Billiard Buddies'), price = round(rnorm(15e6, 1.65, 2.06), 2)) DT <- data.table(DF) ``` After copying our `DF` into a data.table, `DT`, we can now compare the timing of some operations with these two identical sets of data. Defining your dataset as a data.table or data.frame should take equivalent time. ### Selecting a subset For a data.table we first need to use `setkey()` to sort or order the data. This takes a little time but will be well worth the speed increase in subsequent calculations. Below data will first be sorted by `manufacturers` and then sorted by `price` second. For these sorted columns, we can pass a list of values that to get the rows we want. ```{r subset_DT} system.time(setkey(DT, manufacturers, price)) #initial sorting needed for DT ``` ```{r} system.time(ans1 <- DT[.("Smith",1.69)]) ``` This subset assignment happens almost instantly. See how much slower `DF` would be doing the same thing. A few subset selections and it would take almost as long as doing the multi variable sorting. ```{r subset_DF} system.time(ans2 <- DF[DF$manufacturers == c("Smith") & DF$price==1.69,]) ``` Note the differences in elapsed time. Let's check that we got the identical rows both ways (`tail` shows us the last rows, here I've specified the last *3*) ```{r} tail(ans1,3) tail(ans2,3) ``` ### Computing a summary of data Here we find the average price of a poolball by manufacturer. Using *data.table* we get `ans3` pretty quickly ```{r dt_ballprice} system.time(ans3 <- DT[,.(manu_mean = mean(price)), by = manufacturers]) ans3 ``` Using the *data.frame* is much more cumbersome so we will just compute the mean for two manufacturers. Again, note the difference in elapsed time. ```{r man_mean_DF} system.time(Smithmean <- mean(DF[DF$manufacturers == "Smith",]$price), Jamesmean <- mean(DF[DF$manufacturers == "James",]$price)) ``` Now we can just check they are the same as above. ```{r, results = 'hide'} Jamesmean Smithmean ``` Using aggregate we can get the same results from one line with the DF but the results come at a big computing cost. ```{r, results = 'hold'} system.time(ans4 <- aggregate(DF$price, by = list(Manu_mean = DF$manufacturers), mean)) ans4 ``` Using `tapply` with Data.Frame's is much more efficient. ```{r, results='hold'} system.time(ans5 <- tapply(DF$price, list(Manu_mean = DF$manufacturers), mean)) ans5 ``` ### Temporary values You can execute multiple calculations in one j-expression using `{ }`. Here we can even create a temporary variable, `cost`, and compute another value using the temporary variable. Here we find the profit by manufacturer and stripes, assuming the cost is 3 cents per gram weight. Notice that subsequent calculations should be separated with a `;` not a `,`. This approach will make more complex calculations easier to read and take much less time. ```{r} system.time(DT[ , {cost = weight*0.03; profit = mean(price-cost)}, by = .(stripes, manufacturers)]) ``` ```{r, echo = FALSE} DT[ , {cost = weight*0.03; mean(price-cost)}, by = .(stripes, manufacturers)] ``` This specific example can be be done in one line with *data.frame*, using `tapply`. If you were making multiple processing steps, using multiple functions tapply may not work or you'd have to start imbedding them and it would get messy. But for this example, if you're a fan of data.frames, `$` and `tapply` then you can do what data.table does too. ```{r} system.time( tapply((DF$price-DF$weight*0.03), list(DF$stripes,DF$manufacturers), mean)) ``` ## Chaining Another way to accomplish a whole lot in one line is by 'Chaining'. This method utilizes the fact that every `DT[ ]` call returns the data or subset of data defined by the `i`, `j` and `k` expressions. With this new subset you may again call new `i`, `j` and `k` expressions but adding another `[i,j,k=by ]` after the first `DT[i,j,k=by]`. The example below attempts to find the most expensive average cost poolball per manufacturer. ```{r chaining} setkey(DT,number, stripes, manufacturers) ``` ```{r chaining2, eval = FALSE} DT[,.(meancost = mean(price-weight*0.03)), by=.(stripes, manufacturers)][ , .(number, stripes,expensive = max(meancost)),by = manufacturers] ``` ***This returns an error*** We attempted to get the `number` of the ball that was most expensive but because we are looking for the `number` variable in the second `[ ]` it isn't there. In the second `[ ]` we are working with a subset of the original data.table, generated by the first `[ ]`, which has excluded `number` and only includes `stripes`, `manufactureres` and `meancost`. We get the most expensive ball for each manufacturer/stripes below. ```{r} DT[,.(meancost = mean(price-weight*0.03)), by=.(manufacturers,stripes)][ , .(expensive = max(meancost)), by = manufacturers] ``` ## Conclusions Data.table allows for some very intuitive indexing and subset calculations that take far less time than data frame on very large datasets. The approaches shown here will aid in flexible, easier to read programming that can be accomplished in fewer lines of code. The same calculations using data.frame will often take much longer. Functions like sorting and summarizing data are done much faster. Complex calculations of datasets are easily acquired and can often be done in one line with chaining or calling functions in the j-expressions. It's cool, try it. ## Additional Resources * Vignettes included with the data.table package (see the list in output from `help(package = "data.table")` * StackOverflow has a very active discussion area for data.table: http://stackoverflow.com/questions/tagged/data.table * Data Analysis the Data.Table Way: The official Cheat Sheet for the DataCamp course [cheat sheet](https://s3.amazonaws.com/assets.datacamp.com/img/blog/data+table+cheat+sheet.pdf) * Advanced tips and tricks with data.table (by Andrew Brooks) [advanced data.table](http://brooksandrew.github.io/simpleblog/articles/advanced-data-table/) [//]: (All guides must have this as the final stanza) ## R Session Info ```{r sessionInfo, echo = FALSE} sessionInfo() ``` ```{r warnings, echo = FALSE} if(!is.null(warnings())){ print("Warnings:") warnings() } ``` ```{r RoptionsRestore, echo=FALSE, include=FALSE} ## Don't delete this. It puts the interactive session options ## back the way they were. If this is compiled within a session ## it is vital to do this. options(opts.orig) par(par.orig) ``` Available under [Created Commons license 3.0 CC BY](http://creativecommons.org/licenses/by/3.0/)