Zack Roman, CRMDA <zroman@ku.edu>
Ben Kite, CRMDA, H&R Block <bakite@ku.edu>
Chong Xing, CRMDA <cxing@ku.edu>
Paul Johnson, Director, CRMDA <crmda@ku.edu>
For CRMDA Guide documents and updates,
please visit http://crmda.ku.edu/guides-index
Keywords:
Data Frames, Merging, SQL, Left/Right Join
2018 February 01
Abstract
This guide provides an overview of data merging procedures in R. It highlights the uses of three methods for combining data sets: the merge()
function in base R, the join()
function in plyr
package, and the merging commands in data.table
package. Further, it highlights some common problems associated with each method and demonstrates how to avoid them. Examples of data merges in the widely popular structured query language (SQL) framework, and the associated terminology, are provided first and then the equivalent R syntax is presented.
merge()
functionmerge()
from Sorting the Outputkutils
mergeCheck()
Functionmerge()
Summaryplyr
Package for Rdata.table
Package for RThe term merge means to combine two data sets by aligning information about individual rows in each one. For example, in a study of police brutality, one team had public opinion about the attitudes toward police in 33 American cities (Sharp and Johnson (2009)). They had a separate data set about police administration in those cities. It was necessary to merge the information about cities “onto” the information about individuals to find out, for example, if cities with African-American police chiefs experienced an improvement in police evaluations among non-white residents of those cities. This is a common requirement in “multilevel statistical analysis”. Another common analysis scenario that may require combining two, or more than two, data sets is longitudinal research - same individuals have been surveyed at different time waves and the survey responses were stored separately by waves. Usually, there is a unique ID variable for each survey respondent in all the data sets and it can be used as the key variable to link information collected at different time waves about a same person.
Statistical programs like Stata, SAS, SPSS, and R developed their own terminology for matching and aligning different data sets. The methods of combining 2 data sets developed in small, isolated, language-specific bubbles which, in retrospect, ignored the rapidly developing general purpose database language known now as SQL.
From the 1970s onward, the IBM corporation was developing its own general purpose language for data restructuring called SQL, Structured Query Language (see Codd, 1970). SQL has been widely used in database construction and is now considered to be a staple in the construction of data-backed Websites that absorb and redistribute information (e.g., see Oracle’s SQL implementation).
To the surprise of many database enthusiasts, R’s base function merge()
does not use SQL terminology. That surely creates frustration for them, since the terminology for aligning data sets must be completely re-learned. However, as mentioned (somewhat tersely) in the help page for R’s merge()
, it is possible to create a translation table between SQL terminology and the approach used in R.
In SQL database terminology, the default value of ‘all = FALSE’ gives a natural join, a special case of an inner join. Specifying ‘all.x = TRUE’ gives a left (outer) join, ‘all.y = TRUE’ a right (outer) join, and both (‘all = TRUE’ a (full) outer join. DBMSes do not match ‘NULL’ records, equivalent to ‘incomparables = NA’ in R. (R Core Team 2017).
We don’t expect that paragraph is understandable to most readers of this guide. And that is the problem we are trying to fix by preparing this document.
nocite: ‘R Core Team (2017)’
The merge()
function is in the base of R (R Core Team 2017), but there are also user-contributed packages for R that more explicitly use the terminology of SQL. The packages that we emphasize in this note are plyr
and data.table
. We believe it is safe to say that, without a basic understanding of SQL terminology, users will have a very difficult time making use of those packages.
Join
means to combine two data sets. In SQL, the custom is to refer to the two data sets as “left” and “right”. Think of the one on the left as the “target” and information from the “right” is extracted and paired together with the rows of the one on the left.
A key is a variable that is used to align the separate information of the two data sets being combined. The key variable is a common variable between the 2 data frames being joined/merged, typically it is an “ID” variable or “Name”.
SQL Join notation has set a standard language for merging in the data science community. The following is a general outline of the SQL syntax. In this form, the first table is the “left” data frame, and the second is the “right” data frame. When joining/merging, the user can specify qualities of the merge that treat the right and left data frame differently. The proceeding examples highlight this.
SELECT "column_name(s)"
FROM "table1 "
JOIN "table2 "
ON "table1.key1=table2.key2;"
1.) SELECT
names the variables to be retained
2.) table1
is known as the left table, table2
is the right table.
3.) JOIN
is a keyword that specifies the type of merge. The types that we discuss in this note are INNER JOIN
, OUTER JOIN
(AKA FULL JOIN
) and LEFT JOIN
.
4.) ON
specifies the key variables, the ones that are to be aligned. In this case, table1.key1=table2.key2;
means that in table1
, there will be a variable “key1” and that is used to align the cases as specified in table2’s variable “key2”. They might be the name of a city or a person’s identification number.
Note: The .
appended in SQL is equivalent to the $
operator in R.
The merge()
function in base R combines 2 data frames, which it refers to as x
and y
. If we were using SQL terminology, we would say x
is the left
set and y
is the right
set. R uses the name by
to refer to the key variables.
The merge()
function receives the following arguments.
merge(x, y, by, by.x, by.y, incomparables, sort, all.x, all.y)
x
would be called the left
data set in SQL
y
the right
data set
by, by.x, by.y
provide the key variables. Specify different key variables with by.x
and by.y
, or if the key variables have the same name, then simply by
should be used.
incomparables
specifies which values of the key should never be matched. It is a good habit to specify missing values and other values that should not be logically equated. Almost always, we need a laundry list to prevent accidental matches, such as incomparables = c(NA, NULL, Inf)
.
sort
: unless this is FALSE
, R will re-order the rows of the result. The sorting process sometimes seems surprising to users, who would rather have the rows remain in the ordering specified by the x
data set. We suggest most users should set sort = FALSE
, and will illustrate the problem below.
all, all.x, all.y
arguments ask for either, all entries to appear in the output set, or all X
or Y
entries. By default, all = FALSE
so only matching entries from both sides will appear in the output data set. If a variable has a missing entry it will be filled with NA’s
## Data set x
authors
surname nationality deceased
1 Tukey US yes
2 Venables Australia no
3 Tierney US no
4 Ripley UK no
5 McNeil Australia no
## Data set y
books
name title other_author
1 Tukey Exploratory Data Analysis <NA>
2 Venables Modern Applied Statistics Ripley
3 Tierney LISP-STAT <NA>
4 Ripley Spatial Statistics <NA>
5 Ripley Stochastic Simulation <NA>
6 McNeil Interactive Data Analysis <NA>
7 R Core An Introduction to R Venables & Smith
In SQL terminology, an “inner join” is a merge that only keeps information about rows that are matched in the keys of both data sets.
In R’s merge()
function, the parameters all, all.x, and all.y determine whether a merge will perform like an inner join, an outer join, or a left join. The merge()
equivalent of an inner join is obtained if all=FALSE
, which is the default in merge. all=FALSE
implies all.x=FALSE
and all.y=FALSE
.
The following merge will produce a data set in which only cases will appear that have matches from both the x (authors) and y (books) data sets. Notice that the ‘R core team’ is excluded because it does not appear in both data sets.
Note: the x and y data set had different names for the key variable. When this occurs the output data set will have the x variable name.
merg.df.1 <- merge(x = authors, y = books, by.x = "surname", by.y = "name",
all = FALSE)
merg.df.1
surname nationality deceased title other_author
1 McNeil Australia no Interactive Data Analysis <NA>
2 Ripley UK no Spatial Statistics <NA>
3 Ripley UK no Stochastic Simulation <NA>
4 Tierney US no LISP-STAT <NA>
5 Tukey US yes Exploratory Data Analysis <NA>
6 Venables Australia no Modern Applied Statistics Ripley
Note: A “table” in SQL is a “data frame” in R terminology.
SELECT authors.surname, authors.nationality, authors.deceased,
books.name, books.title, books.other_author
FROM authors
INNER JOIN books
ON authors.surname = books.name;
Note: Always end SQL statements with a semi-colon, ;
.
Answer: Not very much! Because an inner join only keeps the items that match exactly in each data set, it does not matter very much in R whether we think of one data set as x
or y
.
This is another inner join, however the x and y variables are switched, the output data set is the same. The most important difference in the result is that the rows and columns come out in different orders.
Note in the output that the R core team is not included because it did not appear in both data sets.
merg.df.2 <- merge(x = books, y = authors, by.x = "name", by.y = "surname")
merg.df.2
name title other_author nationality deceased
1 McNeil Interactive Data Analysis <NA> Australia no
2 Ripley Spatial Statistics <NA> UK no
3 Ripley Stochastic Simulation <NA> UK no
4 Tierney LISP-STAT <NA> US no
5 Tukey Exploratory Data Analysis <NA> US yes
6 Venables Modern Applied Statistics Ripley Australia no
An outer join keeps all of the records from either the data sets. Empty cells (missing values) are created for the unmatched cases.
The following code uses the all
argument with the value TRUE
. This instructs the merge to keep all of the rows, whether there are matches in the other data set.
Note: If a value does not have an associated entry missing values will appear as “NA”. This is exemplified by the R core team row which has “NA” for “nationality” and “deceased”
merg.df.3 <- merge(x = authors, y = books, by.x = "surname", by.y = "name",
all = TRUE)
merg.df.3
surname nationality deceased title other_author
1 McNeil Australia no Interactive Data Analysis <NA>
2 R Core <NA> <NA> An Introduction to R Venables & Smith
3 Ripley UK no Spatial Statistics <NA>
4 Ripley UK no Stochastic Simulation <NA>
5 Tierney US no LISP-STAT <NA>
6 Tukey US yes Exploratory Data Analysis <NA>
7 Venables Australia no Modern Applied Statistics Ripley
SELECT authors.surname, authors.nationality, authors.deceased,
books.name, books.title, books.other_author
FROM books
FULL JOIN authors
ON books.name = authors.surname ;
Left Join will include all entries in the left data set, which we are calling x
. Rows from y
are selected only if the key value matches the key for the x
data set. This drops unmatched entries from y
, in other words.
The left join in R’s merge()
function is obtained by setting all.x = TRUE
and all.y = FALSE
. Because all.y = FALSE
is the default, it is not necessary to insert that in the function call, as we show here.
The R core team is not included in the output data set because it is only found in the y
(right) data set.
merg.df.4 <- merge(x = authors, y = books, by.x = "surname", by.y = "name",
all.x = TRUE)
merg.df.4
surname nationality deceased title other_author
1 McNeil Australia no Interactive Data Analysis <NA>
2 Ripley UK no Spatial Statistics <NA>
3 Ripley UK no Stochastic Simulation <NA>
4 Tierney US no LISP-STAT <NA>
5 Tukey US yes Exploratory Data Analysis <NA>
6 Venables Australia no Modern Applied Statistics Ripley
SELECT authors.surname, authors.nationality, authors.deceased,
books.name, books.title, books.other_author
FROM authors
LEFT JOIN books
ON authors.surname = books.name;
In SQL, the right merge is not usually discussed because one can simply swap the left and right sets to obtain an equivalent merge. However, in R is possible to explicitly conduct a right merge, one which keeps all of the rows from y
and attaches them to the rows from x
which match on the key variable.
To achieve a right join with merge()
, we set all.y = TRUE
and leave the default all.x = FALSE
unchanged.
merg.df.5 <- merge(x = authors, y = books, by.x = "surname", by.y = "name",
all.y = TRUE)
merg.df.5
surname nationality deceased title other_author
1 McNeil Australia no Interactive Data Analysis <NA>
2 R Core <NA> <NA> An Introduction to R Venables & Smith
3 Ripley UK no Spatial Statistics <NA>
4 Ripley UK no Stochastic Simulation <NA>
5 Tierney US no LISP-STAT <NA>
6 Tukey US yes Exploratory Data Analysis <NA>
7 Venables Australia no Modern Applied Statistics Ripley
SELECT authors.surname, authors.nationality, authors.deceased,
books.name, books.title, books.other_author
FROM authors
RIGHT JOIN books
ON authors.surname = books.name;
merge()
functionmerge()
We create two small example data frames with missing values on the key variables, which are named k1
and k2
.
x
k1 k2 ID
1 NA 1 1
2 NA NA 2
3 3 NA 3
4 4 4 4
5 5 5 5
6 6 6 6
y
k1 k2 ID
1 NA NA 1
2 2 NA 2
3 NA 3 3
4 4 4 4
5 5 5 5
6 6 6 6
Notice that the merge treats the NAs in x as if they are matches for the NAs in y. Almost surely, this is an error for most real life data analysis projects.
merg.df.6 <- merge(x = x, y = y, by = c("k1", "k2"))
## Loose your job
merg.df.6
k1 k2 ID.x ID.y
1 4 4 4 4
2 5 5 5 5
3 6 6 6 6
4 NA NA 2 1
To resolve this problem, the ‘incomparables’ argument should specify the values which are not intended for comparison. For instance, we might simply insert NA as an incomparable value.
merg.df.7 <- merge(x = x, y = y, by = c("k2"), incomparables = NA)
The result is considerably more satisfactory
merg.df.7
k2 k1.x ID.x k1.y ID.y
1 4 4 4 4 4
2 5 5 5 5 5
3 6 6 6 6 6
In many projects, we would protect ourselves by including a laundry list of legal R values that should not be compared in the keys. We would almost certainly have incomparables = c(NA, NaN, Inf, "\\s*")
to prevent matches for missing values, the “not a number” indicator, infinity, and blank spaces.
merge()
from Sorting the OutputThe merge()
function will, by default, sort the output data according to the values of the key variables. Because R treats the keys as if they were factors, the sorting of the values may have unintended consequences of “shuffling” the rows. Many users would rather leave the rows of the left hand data set x
as they were, and then when new rows are inserted to accomodate y
, they should be placed at the end of the result.
Here we have an example of how the default sorting approach creates a row-ordering that is, to say the least, unexpected.
Recall that R’s data.frame
function by default converts character variables into R factor variables.
Notice the ordering of ID variable in the merged result. It counts up from \(1, 10, 11, \ldots, 19, 2, 20, 21, \ldots\). Again, almost surely, this is unwanted.
d_merged <- merge(x = d, y = d2, by = c("ID", "Company"),
sort = TRUE, all.x = TRUE)
d_merged
ID Company profit Region Cost
1 1 Company_C 94.60664 Midwest 1107.8273
2 10 Company_B 84.36037 Southeast 729.8151
3 11 Company_D 109.48867 Southeast 1081.8502
4 12 Company_D 94.09264 Deep_South 1047.0861
5 13 Company_D 111.10180 East_Coast 1037.2872
6 14 Company_C 102.15098 Deep_South 904.5234
7 15 Company_B 74.23635 Deep_South 1054.9895
8 16 Company_A 85.56089 Southeast 1021.9197
9 17 Company_A 124.56038 Southeast 995.9387
10 18 Company_B 101.64704 Midwest 1027.2783
11 19 Company_D 97.03944 Southeast 1025.3005
12 2 Company_A 105.93583 East_Coast 1076.0270
13 20 Company_D 111.42333 Southeast 556.2863
14 21 Company_B 94.07712 Midwest 1329.6121
15 22 Company_D 99.93876 West_Coast 961.9473
16 23 Company_D 98.13989 West_Coast 1190.9509
17 24 Company_B 84.79221 Deep_South 1165.2390
18 25 Company_C 78.36247 Southeast 839.2178
19 26 Company_D 104.79393 West_Coast 920.7129
20 27 Company_B 99.20409 Deep_South 1025.8085
21 28 Company_B 69.42068 West_Coast 710.7899
22 29 Company_D 82.57393 West_Coast 1128.0650
23 3 Company_A 112.69700 Midwest 863.1037
24 30 Company_B 103.57594 East_Coast 1099.6825
25 4 Company_C 94.65275 East_Coast 1024.3458
26 5 Company_B 106.85077 East_Coast 973.2696
27 6 Company_B 124.31956 West_Coast 980.5545
28 7 Company_C 108.34467 East_Coast 980.2809
29 8 Company_D 98.15311 West_Coast 1033.9503
30 9 Company_A 95.82280 East_Coast 1086.2214
The solution for this is quite simple. Don’t allow merge()
to sort the rows of the merged set.
d_merged <- merge(x = d, y = d2, by = c("ID", "Company"),
sort = FALSE, all.x = TRUE)
d_merged
ID Company profit Region Cost
1 1 Company_C 94.60664 Midwest 1107.8273
2 2 Company_A 105.93583 East_Coast 1076.0270
3 3 Company_A 112.69700 Midwest 863.1037
4 4 Company_C 94.65275 East_Coast 1024.3458
5 5 Company_B 106.85077 East_Coast 973.2696
6 6 Company_B 124.31956 West_Coast 980.5545
7 7 Company_C 108.34467 East_Coast 980.2809
8 8 Company_D 98.15311 West_Coast 1033.9503
9 9 Company_A 95.82280 East_Coast 1086.2214
10 10 Company_B 84.36037 Southeast 729.8151
11 11 Company_D 109.48867 Southeast 1081.8502
12 12 Company_D 94.09264 Deep_South 1047.0861
13 13 Company_D 111.10180 East_Coast 1037.2872
14 14 Company_C 102.15098 Deep_South 904.5234
15 15 Company_B 74.23635 Deep_South 1054.9895
16 16 Company_A 85.56089 Southeast 1021.9197
17 17 Company_A 124.56038 Southeast 995.9387
18 18 Company_B 101.64704 Midwest 1027.2783
19 19 Company_D 97.03944 Southeast 1025.3005
20 20 Company_D 111.42333 Southeast 556.2863
21 21 Company_B 94.07712 Midwest 1329.6121
22 22 Company_D 99.93876 West_Coast 961.9473
23 23 Company_D 98.13989 West_Coast 1190.9509
24 24 Company_B 84.79221 Deep_South 1165.2390
25 25 Company_C 78.36247 Southeast 839.2178
26 26 Company_D 104.79393 West_Coast 920.7129
27 27 Company_B 99.20409 Deep_South 1025.8085
28 28 Company_B 69.42068 West_Coast 710.7899
29 29 Company_D 82.57393 West_Coast 1128.0650
30 30 Company_B 103.57594 East_Coast 1099.6825
Or, alternatively, more carefully police the values of the key so that when the values are sorted, they will be placed into a correct order. (There is a function in the rockchalk
package named padw0()
that would change the values of ID to \(01, 02, \ldots, 30\) so that the factor variables would have the corrected sort.
d$ID <- rockchalk::padW0(d$ID)
d2$ID <- rockchalk::padW0(d$ID)
d_merged_pad <- merge(x = d, y = d2, by = c("ID", "Company"),
sort = TRUE, all.x = TRUE)
d_merged_pad
ID Company profit Region Cost
1 01 Company_C 94.60664 Midwest 1107.8273
2 02 Company_A 105.93583 East_Coast 1076.0270
3 03 Company_A 112.69700 Midwest 863.1037
4 04 Company_C 94.65275 East_Coast 1024.3458
5 05 Company_B 106.85077 East_Coast 973.2696
6 06 Company_B 124.31956 West_Coast 980.5545
7 07 Company_C 108.34467 East_Coast 980.2809
8 08 Company_D 98.15311 West_Coast 1033.9503
9 09 Company_A 95.82280 East_Coast 1086.2214
10 10 Company_B 84.36037 Southeast 729.8151
11 11 Company_D 109.48867 Southeast 1081.8502
12 12 Company_D 94.09264 Deep_South 1047.0861
13 13 Company_D 111.10180 East_Coast 1037.2872
14 14 Company_C 102.15098 Deep_South 904.5234
15 15 Company_B 74.23635 Deep_South 1054.9895
16 16 Company_A 85.56089 Southeast 1021.9197
17 17 Company_A 124.56038 Southeast 995.9387
18 18 Company_B 101.64704 Midwest 1027.2783
19 19 Company_D 97.03944 Southeast 1025.3005
20 20 Company_D 111.42333 Southeast 556.2863
21 21 Company_B 94.07712 Midwest 1329.6121
22 22 Company_D 99.93876 West_Coast 961.9473
23 23 Company_D 98.13989 West_Coast 1190.9509
24 24 Company_B 84.79221 Deep_South 1165.2390
25 25 Company_C 78.36247 Southeast 839.2178
26 26 Company_D 104.79393 West_Coast 920.7129
27 27 Company_B 99.20409 Deep_South 1025.8085
28 28 Company_B 69.42068 West_Coast 710.7899
29 29 Company_D 82.57393 West_Coast 1128.0650
30 30 Company_B 103.57594 East_Coast 1099.6825
This is sometimes referred to as an “m:1” merge. This kind of case arises when x
is, for example, surveys for children, some of whom are from the same family, and y
is data about the families themselves. There are several observations for each unit in the left data set and these appear as “repeated” key values. In the right data set we want to find one that can be associated with the “m” values of the key variable The information about the family from the y
data set must be copied into each of the rows in x
where the family appears.
For example:
d_long
ID Region
1 1 Southeast
2 1 East_Coast
3 1 Deep_South
4 1 Southeast
5 2 West_Coast
6 2 West_Coast
7 2 East_Coast
8 2 West_Coast
9 3 Deep_South
10 3 East_Coast
11 3 Southeast
12 3 West_Coast
13 4 West_Coast
14 4 Midwest
15 4 Deep_South
16 4 Deep_South
d_wide
ID Industry
1 1 Hot_Dogs
2 2 Candy
3 3 YoYo's
4 4 Toilet_Paper
The merge result accurately fills columns, putting together the repeated values of “Region” with identical matching values of “Industry”.
d_merged <- merge(d_long, d_wide, by.x = "ID", by.y = "ID", all.y = TRUE)
d_merged
ID Region Industry
1 1 Southeast Hot_Dogs
2 1 East_Coast Hot_Dogs
3 1 Deep_South Hot_Dogs
4 1 Southeast Hot_Dogs
5 2 West_Coast Candy
6 2 West_Coast Candy
7 2 East_Coast Candy
8 2 West_Coast Candy
9 3 Deep_South YoYo's
10 3 East_Coast YoYo's
11 3 Southeast YoYo's
12 3 West_Coast YoYo's
13 4 West_Coast Toilet_Paper
14 4 Midwest Toilet_Paper
15 4 Deep_South Toilet_Paper
16 4 Deep_South Toilet_Paper
kutils
mergeCheck()
FunctionSometimes it is tedious to check data frames for illegal or unmatched key values. In the newest version of the kutils
package for R (1.34), we have introduced a new function to provide quick diagnostics on the keys in a proposed merge. The diagnostic is called mergeCheck()
. Consider the following examples taken directly from the mergeCheck()
help documentation (example(mergeCheck)
):
df1
id x
1 1 0.1183406
2 2 -0.1933020
3 3 -1.3085219
4 4 -0.2021461
5 5 0.8784685
6 6 -1.8701397
7 7 0.4722367
df2
id x
1 2 1.649934072
2 3 -0.213928867
3 4 0.263915672
4 5 0.721672513
5 6 0.614073547
6 9 0.825083943
7 10 0.000701853
Notice our two data frames have non matching key values.
mergeCheck(x = df1, y = df2, by = "id")
Merge difficulties detected
Unmatched cases from df1 and df2 :
df1
id x
1 1 0.1183406
7 7 0.4722367
df2
id x
6 9 0.825083943
7 10 0.000701853
mergeCheck()
alerted us to the specific non matching ID numbers. Let’s consider a case with duplicate key values:
df1
id x
1 1 -0.2013404
2 2 1.3976885
3 3 -0.6415373
4 <NA> -0.3933596
5 NaN -0.5498042
6 0.8819304
7 0.8919925
df2
id x
1 2 -0.77630334
2 3 -0.91381163
3 4 2.81056380
4 5 0.80810218
5 6 -0.03209555
6 5 -0.03849050
7 6 1.11230932
Notice the duplicate key values in a single dataframe.
mergeCheck(x = df1, y = df2, by = "id")
Merge difficulties detected
Unacceptable key values
df1
id x
4 <NA> -0.3933596
5 NaN -0.5498042
6 0.8819304
Duplicated key values
df2
id x
4 5 0.80810218
5 6 -0.03209555
6 5 -0.03849050
7 6 1.11230932
Unmatched cases from df1 and df2 :
df1
id x
1 1 -0.2013404
4 <NA> -0.3933596
5 NaN -0.5498042
6 0.8819304
7 0.8919925
df2
id x
3 4 2.81056380
4 5 0.80810218
5 6 -0.03209555
6 5 -0.03849050
7 6 1.11230932
mergeCheck
alerts us to three potential problems here: A missing key value, duplicate key values, and unmatched cases.
The most convenient element of mergeCheck()
is that it maintains the same arguments as merge()
once you pass the mergeCheck()
test, you can simply erase the “Check” portion of the function and safely perform the merge. Let’s take a look at a succsesful merge check:
df1 <- data.frame(id = c(1:10), x = rnorm(10))
df2 <- data.frame(id = c(1:10), x = rnorm(10))
mergeCheck(x = df1, y = df2, by = "id")
Merge difficulties detected
Unmatched cases from df1 and df2 :
df1
[1] id x
<0 rows> (or 0-length row.names)
df2
[1] id x
<0 rows> (or 0-length row.names)
Notice mergeCheck()
has nothing to point out as problematic.
It is important to note that in many situations mergecheck()
may alert us to potential non-matching cases, but don’t fear! If we are conducting a left or right join, we should know well in advance one of our data frames will contain many unmatched IDs. Similarly in an inner join, it is expected we will encounter non-matching ID’s. We should be most concerned with missing key values, as well as duplicate key values.
merge()
SummaryINNER JOIN is equivalent to basic merge(all.x=FALSE, all.y = FALSE)
. Rows without matches are dropped from the resulting data set
LEFT JOIN is equivalent to merge()
with all.x = T
. Rows in the Right (y) data table are dropped if not paired
RIGHT JOIN is equivalent to merge()
with all.y = T
. Rows in the left (x) data table are dropped if not paired
Full JOIN is equivalent to merge()
with all = T
. All rows are incorporated and empty cells are replaced with NULL (NA)
plyr
Package for RThe plyr
package (Wickham (2011)), offers a handful of data manipulation tools that are part of a sequence of offerings that is now known as the “tidyverse”.
The join()
function in plyr
handles left, right, and full merges with results similar to the R base merge()
, but with some subtle conveniences.
The join
function in the plyr
package provides a middle ground between merge()
and SQL’s joining by directly using SQL terminology with the join()
function.
join accepts the following arguments.
join(x, y, by, type, match)
x
x data set (left)
y
y data set (right)
by
key variable
type
type of join. Can be “left”, “right”, “inner”, or “full”.
match
: how should duplicate ids be matched? Either match just the “first” matching row, or match “all” matching rows. Defaults to “all” for compatibility with merge, but “first” is significantly faster.
The plyr
package’s join()
function has several attractive features
It does not, by default, sort the rows. This, by itself, avoids one of the chronic problems for novices who use merge()
join
uses SQL terminology, so its features are more immediately understandable to afficionados of SQL.
join
has seemingly-more sensible defaults for handling of keys. Importantly, two key values that are missing are never matched together in the joining process. In comparison, with merge()
, users must be careful to specify the incomparables
parameter.
Cases have been found in which join()
is faster than merge()
.
The cons of the join
function are that some functionality of merge is omitted. Most importantly:
plyr::join()
Because join()
requires us to have a key variable that has the same name in the two data sets, we need to do some data preparation before merging. Here, we choose to rename the variable surname
as name
.
colnames(authors) <- gsub("surname", "name", colnames(authors))
After that, we can use the variable “name” as the key.
plyr::join
join(x = books, y = authors, by = "name", type = "left")
name title other_author nationality deceased
1 Tukey Exploratory Data Analysis <NA> US yes
2 Venables Modern Applied Statistics Ripley Australia no
3 Tierney LISP-STAT <NA> US no
4 Ripley Spatial Statistics <NA> UK no
5 Ripley Stochastic Simulation <NA> UK no
6 McNeil Interactive Data Analysis <NA> Australia no
7 R Core An Introduction to R Venables & Smith <NA> <NA>
plyr::join
join(x = books , y= authors , by="name",type = "right")
name title other_author nationality deceased
1 Tukey Exploratory Data Analysis <NA> US yes
2 Venables Modern Applied Statistics Ripley Australia no
3 Tierney LISP-STAT <NA> US no
4 Ripley Spatial Statistics <NA> UK no
5 Ripley Stochastic Simulation <NA> UK no
6 McNeil Interactive Data Analysis <NA> Australia no
plyr::join
join(x = books , y = authors , by="name",type = "full")
name title other_author nationality deceased
1 Tukey Exploratory Data Analysis <NA> US yes
2 Venables Modern Applied Statistics Ripley Australia no
3 Tierney LISP-STAT <NA> US no
4 Ripley Spatial Statistics <NA> UK no
5 Ripley Stochastic Simulation <NA> UK no
6 McNeil Interactive Data Analysis <NA> Australia no
7 R Core An Introduction to R Venables & Smith <NA> <NA>
plyr::join
join(x = books, y= authors, by="name",type = "inner")
name title other_author nationality deceased
1 Tukey Exploratory Data Analysis <NA> US yes
2 Venables Modern Applied Statistics Ripley Australia no
3 Tierney LISP-STAT <NA> US no
4 Ripley Spatial Statistics <NA> UK no
5 Ripley Stochastic Simulation <NA> UK no
6 McNeil Interactive Data Analysis <NA> Australia no
This demonstrates that the “m:1” merge works as well with join()
as it does with merge()
.
join(x = d_long , y = d_wide , by = "ID", type = "full")
ID Region Industry
1 1 Southeast Hot_Dogs
2 1 East_Coast Hot_Dogs
3 1 Deep_South Hot_Dogs
4 1 Southeast Hot_Dogs
5 2 West_Coast Candy
6 2 West_Coast Candy
7 2 East_Coast Candy
8 2 West_Coast Candy
9 3 Deep_South YoYo's
10 3 East_Coast YoYo's
11 3 Southeast YoYo's
12 3 West_Coast YoYo's
13 4 West_Coast Toilet_Paper
14 4 Midwest Toilet_Paper
15 4 Deep_South Toilet_Paper
16 4 Deep_South Toilet_Paper
merge(x = d_long, y = d_wide, by = "ID", all.x = TRUE, all.y = TRUE)
ID Region Industry
1 1 Southeast Hot_Dogs
2 1 East_Coast Hot_Dogs
3 1 Deep_South Hot_Dogs
4 1 Southeast Hot_Dogs
5 2 West_Coast Candy
6 2 West_Coast Candy
7 2 East_Coast Candy
8 2 West_Coast Candy
9 3 Deep_South YoYo's
10 3 East_Coast YoYo's
11 3 Southeast YoYo's
12 3 West_Coast YoYo's
13 4 West_Coast Toilet_Paper
14 4 Midwest Toilet_Paper
15 4 Deep_South Toilet_Paper
16 4 Deep_South Toilet_Paper
join(x = x , y = y , by = "ID", type = "inner")
k1 k2 ID k1 k2
1 NA 1 1 NA NA
2 NA NA 2 2 NA
3 3 NA 3 NA 3
4 4 4 4 4 4
5 5 5 5 5 5
6 6 6 6 6 6
It does not have sorting issues by default
join(x = d, y = d2, by = "ID", type = "inner")
ID Company Profit Company Region Cost
1 1 Company_A 89.35583 Company_A Deep_South 797.7308
2 2 Company_B 102.83809 Company_B East_Coast 1101.8534
3 3 Company_A 98.06931 Company_A East_Coast 896.0806
4 4 Company_D 110.61756 Company_D Deep_South 967.5251
5 5 Company_D 96.34582 Company_D West_Coast 1158.7691
6 6 Company_B 107.56797 Company_B Deep_South 995.3901
7 7 Company_D 74.31186 Company_D Southeast 874.8078
8 8 Company_D 95.65268 Company_D East_Coast 1164.3170
9 9 Company_D 101.90891 Company_D Midwest 901.2424
10 10 Company_A 109.27751 Company_A East_Coast 1060.5660
11 11 Company_B 102.53551 Company_B East_Coast 804.7798
12 12 Company_C 98.26921 Company_C Southeast 976.1280
13 13 Company_B 86.09252 Company_B Midwest 1025.3395
14 14 Company_C 88.54110 Company_C Southeast 1023.7806
15 15 Company_A 95.08599 Company_A Southeast 765.8460
16 16 Company_D 102.08956 Company_D Southeast 1138.5084
17 17 Company_A 73.76429 Company_A Deep_South 919.6904
18 18 Company_D 97.95117 Company_D Deep_South 1001.0430
19 19 Company_A 94.66364 Company_A Deep_South 1004.8157
20 20 Company_C 93.46937 Company_C Southeast 918.1178
21 21 Company_C 112.19280 Company_C East_Coast 1125.9425
22 22 Company_A 116.77361 Company_A East_Coast 1027.5019
23 23 Company_B 99.19894 Company_B West_Coast 1020.3370
24 24 Company_C 97.37098 Company_C Southeast 879.4884
25 25 Company_C 105.48620 Company_C East_Coast 1269.7293
26 26 Company_D 101.03255 Company_D Midwest 1020.5146
27 27 Company_D 101.25741 Company_D Deep_South 750.3661
28 28 Company_C 91.17284 Company_C Deep_South 921.7680
29 29 Company_C 102.61665 Company_C Midwest 841.4575
30 30 Company_A 89.43695 Company_A Southeast 1197.9956
data.table
Package for RThe data.table
package (Dowle and Srinivasan (2017)) for R offers another method for merging data. The CRMDA has prepared a separate guide, with working examples, for data.table
Data Table Guide: Basic Usage and Advanced Topics. The data.table
package has been especially helpful in dealing with very large data structures, reducing runtime from hours to minutes in several cases. For that reason, we typically prefer to use the data.table
package when working with large datasets (several million rows). The ‘data.table’ package merges data frames (which are first converted to data tables) faster than the aforementioned methods. We suggest using this link to learn more about ‘data.table’: https://www.datacamp.com/courses/data-analysis-the-data-table-way.
Here we replicate some of the previous examples with ‘data.table’ to demonstrate its use.
library(data.table)
We can take the authors and books data frames from before and convert them to data table objects. We must also set the keys for these data tables. The key(s) indicates which variables to sort the data table by and what to use as a key when merging.
authors
surname nationality deceased
1 Tukey US yes
2 Venables Australia no
3 Tierney US no
4 Ripley UK no
5 McNeil Australia no
books
name title other_author
1 Tukey Exploratory Data Analysis <NA>
2 Venables Modern Applied Statistics Ripley
3 Tierney LISP-STAT <NA>
4 Ripley Spatial Statistics <NA>
5 Ripley Stochastic Simulation <NA>
6 McNeil Interactive Data Analysis <NA>
7 R Core An Introduction to R Venables & Smith
authors <- as.data.table(authors)
books <- as.data.table(books)
setkey(authors, surname)
setkey(books, name)
data.table
It appears backwards, but here is a left join of books
(y) onto authors
(x
).
books[authors]
name title other_author nationality deceased
1: McNeil Interactive Data Analysis NA Australia no
2: Ripley Spatial Statistics NA UK no
3: Ripley Stochastic Simulation NA UK no
4: Tierney LISP-STAT NA US no
5: Tukey Exploratory Data Analysis NA US yes
6: Venables Modern Applied Statistics Ripley Australia no
We can reverse it for a right join.
authors[books]
surname nationality deceased title
1: McNeil Australia no Interactive Data Analysis
2: R Core NA NA An Introduction to R
3: Ripley UK no Spatial Statistics
4: Ripley UK no Stochastic Simulation
5: Tierney US no LISP-STAT
6: Tukey US yes Exploratory Data Analysis
7: Venables Australia no Modern Applied Statistics
other_author
1: NA
2: Venables & Smith
3: NA
4: NA
5: NA
6: NA
7: Ripley
And specifying ‘nomatch = 0’ indicates an inner join.
authors[books, nomatch=0]
surname nationality deceased title other_author
1: McNeil Australia no Interactive Data Analysis NA
2: Ripley UK no Spatial Statistics NA
3: Ripley UK no Stochastic Simulation NA
4: Tierney US no LISP-STAT NA
5: Tukey US yes Exploratory Data Analysis NA
6: Venables Australia no Modern Applied Statistics Ripley
Here we have two data frames with information about NFL quarterbacks and how many Super Bowls and league MVP awards they have won.
x <- data.frame("Name" = c("Johnson", NA, "Brady", "Ryan", NA, "Goff"),
"Titles" = c(1, 0, 5, 0, 1, 0))
y <- data.frame("LastName" = c(NA, "Marino", "Brady", "Ryan", "Wilson",
"Bryant", NA),
"MVPs" = c(0, NA, 3, 0, 0, 0, 0))
x
Name Titles
1 Johnson 1
2 <NA> 0
3 Brady 5
4 Ryan 0
5 <NA> 1
6 Goff 0
y
LastName MVPs
1 <NA> 0
2 Marino NA
3 Brady 3
4 Ryan 0
5 Wilson 0
6 Bryant 0
7 <NA> 0
Unfortunately, matches occur on with missing key values.
x <- as.data.table(x)
y <- as.data.table(y)
setkey(x, Name)
setkey(y, LastName)
## Left join
y[x]
LastName MVPs Titles
1: NA 0 0
2: NA 0 0
3: NA 0 1
4: NA 0 1
5: Brady 3 5
6: Goff NA 0
7: Johnson NA 1
8: Ryan 0 0
## Right join
x[y]
Name Titles MVPs
1: NA 0 0
2: NA 1 0
3: NA 0 0
4: NA 1 0
5: Brady 5 3
6: Bryant NA 0
7: Marino NA NA
8: Ryan 0 0
9: Wilson NA 0
## Inner join
y[x, nomatch = 0]
LastName MVPs Titles
1: NA 0 0
2: NA 0 0
3: NA 0 1
4: NA 0 1
5: Brady 3 5
6: Ryan 0 0
Clearly, missing values in keys is going to be a problem. Removing cases with missing prior to merging might be the best option.
The data.table package allows multiple key values to be specified. This can be useful when a single variable does not uniquely identify each row. In the data created here there are two cases with the last name “Manning.”
x <- data.frame("First" = c("Peyton", "Eli", "Tom", "Matt"),
"Last" = c("Manning", "Manning", "Brady", "Ryan"),
"Titles" = c(2, 2, 5, 0))
y <- data.frame("FirstName" = c("Peyton", "Tom", "Matt", "Russell",
"Dez", "Von"),
"LastName" = c("Manning", "Brady", "Ryan", "Wilson",
"Bryant", "Miller"),
"MVPs" = c(3, 3, 1, 0, 0, 0))
x
First Last Titles
1 Peyton Manning 2
2 Eli Manning 2
3 Tom Brady 5
4 Matt Ryan 0
y
FirstName LastName MVPs
1 Peyton Manning 3
2 Tom Brady 3
3 Matt Ryan 1
4 Russell Wilson 0
5 Dez Bryant 0
6 Von Miller 0
If we were to merge the data tables using only the last name variables as the key, the results would be strange.
x <- data.table(x)
y <- data.table(y)
setkey(x, Last)
setkey(y, LastName)
## Left join
y[x]
FirstName LastName MVPs First Titles
1: Tom Brady 3 Tom 5
2: Peyton Manning 3 Peyton 2
3: Peyton Manning 3 Eli 2
4: Matt Ryan 1 Matt 0
## Right join
x[y]
First Last Titles FirstName MVPs
1: Tom Brady 5 Tom 3
2: NA Bryant NA Dez 0
3: Peyton Manning 2 Peyton 3
4: Eli Manning 2 Peyton 3
5: NA Miller NA Von 0
6: Matt Ryan 0 Matt 1
7: NA Wilson NA Russell 0
## Inner join
y[x, nomatch = 0]
FirstName LastName MVPs First Titles
1: Tom Brady 3 Tom 5
2: Peyton Manning 3 Peyton 2
3: Peyton Manning 3 Eli 2
4: Matt Ryan 1 Matt 0
Check to see if another variable can be used as a key!
Are the values in ‘x[ , Last]’ unique?
nrow(x) == length(unique(x[ , Last]))
[1] FALSE
How about ‘x[ , Last, First]’?
nrow(x) == nrow(unique(x[ , Last, First]))
[1] TRUE
These two variables uniquely identify each row! We can use both of them as keys.
setkey(x, Last, First)
setkey(y, LastName, FirstName)
## Left join
y[x]
FirstName LastName MVPs Titles
1: Tom Brady 3 5
2: Eli Manning NA 2
3: Peyton Manning 3 2
4: Matt Ryan 1 0
## Right join
x[y]
First Last Titles MVPs
1: Tom Brady 5 3
2: Dez Bryant NA 0
3: Peyton Manning 2 3
4: Von Miller NA 0
5: Matt Ryan 0 1
6: Russell Wilson NA 0
The ‘data.table’ package provides the same output from merging as the previously shown methods (with the exception ‘data.table’ producing ‘data.table’ objects, not ‘data.frame’ objects). The major benefit of using ‘data.table’ is the increase in speed when working with large data frames.
Dowle, Matt, and Arun Srinivasan. 2017. Data.table: Extension of ‘Data.frame‘. https://CRAN.R-project.org/package=data.table.
R Core Team. 2017. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Sharp, Elaine B., and Paul E. Johnson. 2009. “Accounting for Variation in Distrust of Local Police.” Justice Quarterly 26 (1): 157–82. doi:10.1080/07418820802290496.
Wickham, Hadley. 2011. “The Split-Apply-Combine Strategy for Data Analysis.” Journal of Statistical Software 40 (1): 1–29. http://www.jstatsoft.org/v40/i01/.