Terminology

The term merge means to combine two data sets by aligning information about individual rows in each one. For example, in a study of police brutality, one team had public opinion about the attitudes toward police in 33 American cities (Sharp and Johnson (2009)). They had a separate data set about police administration in those cities. It was necessary to merge the information about cities “onto” the information about individuals to find out, for example, if cities with African-American police chiefs experienced an improvement in police evaluations among non-white residents of those cities. This is a common requirement in “multilevel statistical analysis”. Another common analysis scenario that may require combining two, or more than two, data sets is longitudinal research - same individuals have been surveyed at different time waves and the survey responses were stored separately by waves. Usually, there is a unique ID variable for each survey respondent in all the data sets and it can be used as the key variable to link information collected at different time waves about a same person.

What is SQL and Why Do We Care?

Statistical programs like Stata, SAS, SPSS, and R developed their own terminology for matching and aligning different data sets. The methods of combining 2 data sets developed in small, isolated, language-specific bubbles which, in retrospect, ignored the rapidly developing general purpose database language known now as SQL.

From the 1970s onward, the IBM corporation was developing its own general purpose language for data restructuring called SQL, Structured Query Language (see Codd, 1970). SQL has been widely used in database construction and is now considered to be a staple in the construction of data-backed Websites that absorb and redistribute information (e.g., see Oracle’s SQL implementation).

To the surprise of many database enthusiasts, R’s base function merge() does not use SQL terminology. That surely creates frustration for them, since the terminology for aligning data sets must be completely re-learned. However, as mentioned (somewhat tersely) in the help page for R’s merge(), it is possible to create a translation table between SQL terminology and the approach used in R.

In SQL database terminology, the default value of ‘all = FALSE’ gives a natural join, a special case of an inner join. Specifying ‘all.x = TRUE’ gives a left (outer) join, ‘all.y = TRUE’ a right (outer) join, and both (‘all = TRUE’ a (full) outer join. DBMSes do not match ‘NULL’ records, equivalent to ‘incomparables = NA’ in R. (R Core Team 2017).

We don’t expect that paragraph is understandable to most readers of this guide. And that is the problem we are trying to fix by preparing this document.

nocite: ‘R Core Team (2017)

The merge() function is in the base of R (R Core Team 2017), but there are also user-contributed packages for R that more explicitly use the terminology of SQL. The packages that we emphasize in this note are plyr and data.table. We believe it is safe to say that, without a basic understanding of SQL terminology, users will have a very difficult time making use of those packages.

SQL Terminology

Join means to combine two data sets. In SQL, the custom is to refer to the two data sets as “left” and “right”. Think of the one on the left as the “target” and information from the “right” is extracted and paired together with the rows of the one on the left.

A key is a variable that is used to align the separate information of the two data sets being combined. The key variable is a common variable between the 2 data frames being joined/merged, typically it is an “ID” variable or “Name”.

SQL Join notation has set a standard language for merging in the data science community. The following is a general outline of the SQL syntax. In this form, the first table is the “left” data frame, and the second is the “right” data frame. When joining/merging, the user can specify qualities of the merge that treat the right and left data frame differently. The proceeding examples highlight this.

 SELECT "column_name(s)"
 FROM "table1 "
 JOIN "table2 "
 ON "table1.key1=table2.key2;"

Arguments

1.) SELECT names the variables to be retained

2.) table1 is known as the left table, table2 is the right table.

3.) JOIN is a keyword that specifies the type of merge. The types that we discuss in this note are INNER JOIN, OUTER JOIN (AKA FULL JOIN) and LEFT JOIN.

4.) ON specifies the key variables, the ones that are to be aligned. In this case, table1.key1=table2.key2; means that in table1, there will be a variable “key1” and that is used to align the cases as specified in table2’s variable “key2”. They might be the name of a city or a person’s identification number.

Note: The . appended in SQL is equivalent to the $ operator in R.

Merge in R

The merge() function in base R combines 2 data frames, which it refers to as x and y. If we were using SQL terminology, we would say x is the left set and y is the right set. R uses the name by to refer to the key variables.

Syntax of R merge()

The merge() function receives the following arguments.

Arguments

  1. x would be called the left data set in SQL

  2. y the right data set

  3. by, by.x, by.y provide the key variables. Specify different key variables with by.x and by.y, or if the key variables have the same name, then simply by should be used.

  4. incomparables specifies which values of the key should never be matched. It is a good habit to specify missing values and other values that should not be logically equated. Almost always, we need a laundry list to prevent accidental matches, such as incomparables = c(NA, NULL, Inf).

  5. sort: unless this is FALSE, R will re-order the rows of the result. The sorting process sometimes seems surprising to users, who would rather have the rows remain in the ordering specified by the x data set. We suggest most users should set sort = FALSE, and will illustrate the problem below.

  6. all, all.x, all.y arguments ask for either, all entries to appear in the output set, or all X or Y entries. By default, all = FALSE so only matching entries from both sides will appear in the output data set. If a variable has a missing entry it will be filled with NA’s

Our first example data sets:

## Data set x
authors
   surname nationality deceased
1    Tukey          US      yes
2 Venables   Australia       no
3  Tierney          US       no
4   Ripley          UK       no
5   McNeil   Australia       no
## Data set y
books
      name                     title     other_author
1    Tukey Exploratory Data Analysis             <NA>
2 Venables Modern Applied Statistics           Ripley
3  Tierney                 LISP-STAT             <NA>
4   Ripley        Spatial Statistics             <NA>
5   Ripley     Stochastic Simulation             <NA>
6   McNeil Interactive Data Analysis             <NA>
7   R Core      An Introduction to R Venables & Smith

Inner Join

In SQL terminology, an “inner join” is a merge that only keeps information about rows that are matched in the keys of both data sets.

In R’s merge() function, the parameters all, all.x, and all.y determine whether a merge will perform like an inner join, an outer join, or a left join. The merge() equivalent of an inner join is obtained if all=FALSE, which is the default in merge. all=FALSE implies all.x=FALSE and all.y=FALSE.

R merge with all=FALSE

The following merge will produce a data set in which only cases will appear that have matches from both the x (authors) and y (books) data sets. Notice that the ‘R core team’ is excluded because it does not appear in both data sets.

Note: the x and y data set had different names for the key variable. When this occurs the output data set will have the x variable name.

merg.df.1 <- merge(x = authors, y = books, by.x = "surname", by.y = "name",
                   all = FALSE)
merg.df.1
   surname nationality deceased                     title other_author
1   McNeil   Australia       no Interactive Data Analysis         <NA>
2   Ripley          UK       no        Spatial Statistics         <NA>
3   Ripley          UK       no     Stochastic Simulation         <NA>
4  Tierney          US       no                 LISP-STAT         <NA>
5    Tukey          US      yes Exploratory Data Analysis         <NA>
6 Venables   Australia       no Modern Applied Statistics       Ripley

SQL Inner Join

Note: A “table” in SQL is a “data frame” in R terminology.

 SELECT authors.surname, authors.nationality, authors.deceased,
        books.name, books.title, books.other_author
 FROM authors
 INNER JOIN books
 ON authors.surname = books.name;

Note: Always end SQL statements with a semi-colon, ;.

Does it matter if the left and right data sets are swapped?

Answer: Not very much! Because an inner join only keeps the items that match exactly in each data set, it does not matter very much in R whether we think of one data set as x or y.

This is another inner join, however the x and y variables are switched, the output data set is the same. The most important difference in the result is that the rows and columns come out in different orders.

Note in the output that the R core team is not included because it did not appear in both data sets.

merg.df.2 <- merge(x = books, y = authors, by.x = "name", by.y = "surname")
merg.df.2
      name                     title other_author nationality deceased
1   McNeil Interactive Data Analysis         <NA>   Australia       no
2   Ripley        Spatial Statistics         <NA>          UK       no
3   Ripley     Stochastic Simulation         <NA>          UK       no
4  Tierney                 LISP-STAT         <NA>          US       no
5    Tukey Exploratory Data Analysis         <NA>          US      yes
6 Venables Modern Applied Statistics       Ripley   Australia       no

Outer Join (Full Join)

An outer join keeps all of the records from either the data sets. Empty cells (missing values) are created for the unmatched cases.

merge: Outer join implies all = TRUE

The following code uses the all argument with the value TRUE. This instructs the merge to keep all of the rows, whether there are matches in the other data set.

Note: If a value does not have an associated entry missing values will appear as “NA”. This is exemplified by the R core team row which has “NA” for “nationality” and “deceased”

merg.df.3 <- merge(x = authors, y = books, by.x = "surname", by.y = "name",
                   all = TRUE)
merg.df.3
   surname nationality deceased                     title     other_author
1   McNeil   Australia       no Interactive Data Analysis             <NA>
2   R Core        <NA>     <NA>      An Introduction to R Venables & Smith
3   Ripley          UK       no        Spatial Statistics             <NA>
4   Ripley          UK       no     Stochastic Simulation             <NA>
5  Tierney          US       no                 LISP-STAT             <NA>
6    Tukey          US      yes Exploratory Data Analysis             <NA>
7 Venables   Australia       no Modern Applied Statistics           Ripley

Equivalent SQL code

 SELECT authors.surname, authors.nationality, authors.deceased,
        books.name, books.title, books.other_author
 FROM books
 FULL JOIN authors
 ON books.name = authors.surname ;

Left Join

Left Join will include all entries in the left data set, which we are calling x. Rows from y are selected only if the key value matches the key for the x data set. This drops unmatched entries from y, in other words.

merge: Left join implies all.x = TRUE

The left join in R’s merge() function is obtained by setting all.x = TRUE and all.y = FALSE. Because all.y = FALSE is the default, it is not necessary to insert that in the function call, as we show here.

The R core team is not included in the output data set because it is only found in the y (right) data set.

merg.df.4 <- merge(x = authors, y = books, by.x = "surname", by.y = "name",
            all.x = TRUE)
merg.df.4
   surname nationality deceased                     title other_author
1   McNeil   Australia       no Interactive Data Analysis         <NA>
2   Ripley          UK       no        Spatial Statistics         <NA>
3   Ripley          UK       no     Stochastic Simulation         <NA>
4  Tierney          US       no                 LISP-STAT         <NA>
5    Tukey          US      yes Exploratory Data Analysis         <NA>
6 Venables   Australia       no Modern Applied Statistics       Ripley

Equivalent SQL code

 SELECT authors.surname, authors.nationality, authors.deceased,
        books.name, books.title, books.other_author
 FROM authors
 LEFT JOIN books
 ON authors.surname = books.name;

Right Join

In SQL, the right merge is not usually discussed because one can simply swap the left and right sets to obtain an equivalent merge. However, in R is possible to explicitly conduct a right merge, one which keeps all of the rows from y and attaches them to the rows from x which match on the key variable.

merge: Right join implies all.y = TRUE

To achieve a right join with merge(), we set all.y = TRUE and leave the default all.x = FALSE unchanged.

merg.df.5 <- merge(x = authors, y = books, by.x = "surname", by.y = "name",
                   all.y = TRUE)
merg.df.5
   surname nationality deceased                     title     other_author
1   McNeil   Australia       no Interactive Data Analysis             <NA>
2   R Core        <NA>     <NA>      An Introduction to R Venables & Smith
3   Ripley          UK       no        Spatial Statistics             <NA>
4   Ripley          UK       no     Stochastic Simulation             <NA>
5  Tierney          US       no                 LISP-STAT             <NA>
6    Tukey          US      yes Exploratory Data Analysis             <NA>
7 Venables   Australia       no Modern Applied Statistics           Ripley

Equivalent SQL code

 SELECT authors.surname, authors.nationality, authors.deceased,
        books.name, books.title, books.other_author
 FROM authors
 RIGHT JOIN books
 ON authors.surname = books.name;

Troubleshooting R’s merge() function

Avoiding unintended matches of missing values in the key

Use the incomparables parameter in merge()

We create two small example data frames with missing values on the key variables, which are named k1 and k2.

x
  k1 k2 ID
1 NA  1  1
2 NA NA  2
3  3 NA  3
4  4  4  4
5  5  5  5
6  6  6  6
y
  k1 k2 ID
1 NA NA  1
2  2 NA  2
3 NA  3  3
4  4  4  4
5  5  5  5
6  6  6  6

Notice that the merge treats the NAs in x as if they are matches for the NAs in y. Almost surely, this is an error for most real life data analysis projects.

merg.df.6 <- merge(x = x, y = y, by = c("k1", "k2"))
## Loose your job
merg.df.6
  k1 k2 ID.x ID.y
1  4  4    4    4
2  5  5    5    5
3  6  6    6    6
4 NA NA    2    1

To resolve this problem, the ‘incomparables’ argument should specify the values which are not intended for comparison. For instance, we might simply insert NA as an incomparable value.

merg.df.7 <- merge(x = x, y = y, by = c("k2"), incomparables = NA)

The result is considerably more satisfactory

merg.df.7
  k2 k1.x ID.x k1.y ID.y
1  4    4    4    4    4
2  5    5    5    5    5
3  6    6    6    6    6

In many projects, we would protect ourselves by including a laundry list of legal R values that should not be compared in the keys. We would almost certainly have incomparables = c(NA, NaN, Inf, "\\s*") to prevent matches for missing values, the “not a number” indicator, infinity, and blank spaces.

Prevent merge() from Sorting the Output

The merge() function will, by default, sort the output data according to the values of the key variables. Because R treats the keys as if they were factors, the sorting of the values may have unintended consequences of “shuffling” the rows. Many users would rather leave the rows of the left hand data set x as they were, and then when new rows are inserted to accomodate y, they should be placed at the end of the result.

Here we have an example of how the default sorting approach creates a row-ordering that is, to say the least, unexpected.

Recall that R’s data.frame function by default converts character variables into R factor variables.

Notice the ordering of ID variable in the merged result. It counts up from \(1, 10, 11, \ldots, 19, 2, 20, 21, \ldots\). Again, almost surely, this is unwanted.

d_merged <- merge(x = d, y = d2, by = c("ID", "Company"),
                  sort = TRUE, all.x = TRUE)
d_merged
   ID   Company    profit     Region      Cost
1   1 Company_C  94.60664    Midwest 1107.8273
2  10 Company_B  84.36037  Southeast  729.8151
3  11 Company_D 109.48867  Southeast 1081.8502
4  12 Company_D  94.09264 Deep_South 1047.0861
5  13 Company_D 111.10180 East_Coast 1037.2872
6  14 Company_C 102.15098 Deep_South  904.5234
7  15 Company_B  74.23635 Deep_South 1054.9895
8  16 Company_A  85.56089  Southeast 1021.9197
9  17 Company_A 124.56038  Southeast  995.9387
10 18 Company_B 101.64704    Midwest 1027.2783
11 19 Company_D  97.03944  Southeast 1025.3005
12  2 Company_A 105.93583 East_Coast 1076.0270
13 20 Company_D 111.42333  Southeast  556.2863
14 21 Company_B  94.07712    Midwest 1329.6121
15 22 Company_D  99.93876 West_Coast  961.9473
16 23 Company_D  98.13989 West_Coast 1190.9509
17 24 Company_B  84.79221 Deep_South 1165.2390
18 25 Company_C  78.36247  Southeast  839.2178
19 26 Company_D 104.79393 West_Coast  920.7129
20 27 Company_B  99.20409 Deep_South 1025.8085
21 28 Company_B  69.42068 West_Coast  710.7899
22 29 Company_D  82.57393 West_Coast 1128.0650
23  3 Company_A 112.69700    Midwest  863.1037
24 30 Company_B 103.57594 East_Coast 1099.6825
25  4 Company_C  94.65275 East_Coast 1024.3458
26  5 Company_B 106.85077 East_Coast  973.2696
27  6 Company_B 124.31956 West_Coast  980.5545
28  7 Company_C 108.34467 East_Coast  980.2809
29  8 Company_D  98.15311 West_Coast 1033.9503
30  9 Company_A  95.82280 East_Coast 1086.2214

The solution for this is quite simple. Don’t allow merge() to sort the rows of the merged set.

d_merged <- merge(x = d, y = d2, by = c("ID", "Company"),
                  sort = FALSE, all.x = TRUE)
d_merged
   ID   Company    profit     Region      Cost
1   1 Company_C  94.60664    Midwest 1107.8273
2   2 Company_A 105.93583 East_Coast 1076.0270
3   3 Company_A 112.69700    Midwest  863.1037
4   4 Company_C  94.65275 East_Coast 1024.3458
5   5 Company_B 106.85077 East_Coast  973.2696
6   6 Company_B 124.31956 West_Coast  980.5545
7   7 Company_C 108.34467 East_Coast  980.2809
8   8 Company_D  98.15311 West_Coast 1033.9503
9   9 Company_A  95.82280 East_Coast 1086.2214
10 10 Company_B  84.36037  Southeast  729.8151
11 11 Company_D 109.48867  Southeast 1081.8502
12 12 Company_D  94.09264 Deep_South 1047.0861
13 13 Company_D 111.10180 East_Coast 1037.2872
14 14 Company_C 102.15098 Deep_South  904.5234
15 15 Company_B  74.23635 Deep_South 1054.9895
16 16 Company_A  85.56089  Southeast 1021.9197
17 17 Company_A 124.56038  Southeast  995.9387
18 18 Company_B 101.64704    Midwest 1027.2783
19 19 Company_D  97.03944  Southeast 1025.3005
20 20 Company_D 111.42333  Southeast  556.2863
21 21 Company_B  94.07712    Midwest 1329.6121
22 22 Company_D  99.93876 West_Coast  961.9473
23 23 Company_D  98.13989 West_Coast 1190.9509
24 24 Company_B  84.79221 Deep_South 1165.2390
25 25 Company_C  78.36247  Southeast  839.2178
26 26 Company_D 104.79393 West_Coast  920.7129
27 27 Company_B  99.20409 Deep_South 1025.8085
28 28 Company_B  69.42068 West_Coast  710.7899
29 29 Company_D  82.57393 West_Coast 1128.0650
30 30 Company_B 103.57594 East_Coast 1099.6825

Or, alternatively, more carefully police the values of the key so that when the values are sorted, they will be placed into a correct order. (There is a function in the rockchalk package named padw0() that would change the values of ID to \(01, 02, \ldots, 30\) so that the factor variables would have the corrected sort.

d$ID <- rockchalk::padW0(d$ID)
d2$ID <- rockchalk::padW0(d$ID)

d_merged_pad <- merge(x = d, y = d2, by = c("ID", "Company"),
                      sort = TRUE, all.x = TRUE)
d_merged_pad
   ID   Company    profit     Region      Cost
1  01 Company_C  94.60664    Midwest 1107.8273
2  02 Company_A 105.93583 East_Coast 1076.0270
3  03 Company_A 112.69700    Midwest  863.1037
4  04 Company_C  94.65275 East_Coast 1024.3458
5  05 Company_B 106.85077 East_Coast  973.2696
6  06 Company_B 124.31956 West_Coast  980.5545
7  07 Company_C 108.34467 East_Coast  980.2809
8  08 Company_D  98.15311 West_Coast 1033.9503
9  09 Company_A  95.82280 East_Coast 1086.2214
10 10 Company_B  84.36037  Southeast  729.8151
11 11 Company_D 109.48867  Southeast 1081.8502
12 12 Company_D  94.09264 Deep_South 1047.0861
13 13 Company_D 111.10180 East_Coast 1037.2872
14 14 Company_C 102.15098 Deep_South  904.5234
15 15 Company_B  74.23635 Deep_South 1054.9895
16 16 Company_A  85.56089  Southeast 1021.9197
17 17 Company_A 124.56038  Southeast  995.9387
18 18 Company_B 101.64704    Midwest 1027.2783
19 19 Company_D  97.03944  Southeast 1025.3005
20 20 Company_D 111.42333  Southeast  556.2863
21 21 Company_B  94.07712    Midwest 1329.6121
22 22 Company_D  99.93876 West_Coast  961.9473
23 23 Company_D  98.13989 West_Coast 1190.9509
24 24 Company_B  84.79221 Deep_South 1165.2390
25 25 Company_C  78.36247  Southeast  839.2178
26 26 Company_D 104.79393 West_Coast  920.7129
27 27 Company_B  99.20409 Deep_South 1025.8085
28 28 Company_B  69.42068 West_Coast  710.7899
29 29 Company_D  82.57393 West_Coast 1128.0650
30 30 Company_B 103.57594 East_Coast 1099.6825

Merging Long Data with Wide Data

This is sometimes referred to as an “m:1” merge. This kind of case arises when x is, for example, surveys for children, some of whom are from the same family, and y is data about the families themselves. There are several observations for each unit in the left data set and these appear as “repeated” key values. In the right data set we want to find one that can be associated with the “m” values of the key variable The information about the family from the y data set must be copied into each of the rows in x where the family appears.

For example:

d_long
   ID     Region
1   1  Southeast
2   1 East_Coast
3   1 Deep_South
4   1  Southeast
5   2 West_Coast
6   2 West_Coast
7   2 East_Coast
8   2 West_Coast
9   3 Deep_South
10  3 East_Coast
11  3  Southeast
12  3 West_Coast
13  4 West_Coast
14  4    Midwest
15  4 Deep_South
16  4 Deep_South
d_wide
  ID     Industry
1  1     Hot_Dogs
2  2        Candy
3  3       YoYo's
4  4 Toilet_Paper

The merge result accurately fills columns, putting together the repeated values of “Region” with identical matching values of “Industry”.

d_merged <- merge(d_long, d_wide, by.x = "ID", by.y = "ID", all.y = TRUE)
d_merged
   ID     Region     Industry
1   1  Southeast     Hot_Dogs
2   1 East_Coast     Hot_Dogs
3   1 Deep_South     Hot_Dogs
4   1  Southeast     Hot_Dogs
5   2 West_Coast        Candy
6   2 West_Coast        Candy
7   2 East_Coast        Candy
8   2 West_Coast        Candy
9   3 Deep_South       YoYo's
10  3 East_Coast       YoYo's
11  3  Southeast       YoYo's
12  3 West_Coast       YoYo's
13  4 West_Coast Toilet_Paper
14  4    Midwest Toilet_Paper
15  4 Deep_South Toilet_Paper
16  4 Deep_South Toilet_Paper

kutils mergeCheck() Function

Sometimes it is tedious to check data frames for illegal or unmatched key values. In the newest version of the kutils package for R (1.34), we have introduced a new function to provide quick diagnostics on the keys in a proposed merge. The diagnostic is called mergeCheck(). Consider the following examples taken directly from the mergeCheck() help documentation (example(mergeCheck)):

df1
  id          x
1  1  0.1183406
2  2 -0.1933020
3  3 -1.3085219
4  4 -0.2021461
5  5  0.8784685
6  6 -1.8701397
7  7  0.4722367
df2
  id            x
1  2  1.649934072
2  3 -0.213928867
3  4  0.263915672
4  5  0.721672513
5  6  0.614073547
6  9  0.825083943
7 10  0.000701853

Notice our two data frames have non matching key values.

mergeCheck(x = df1, y = df2, by = "id")
Merge difficulties detected

Unmatched cases from df1 and df2 :
df1 
  id         x
1  1 0.1183406
7  7 0.4722367
df2 
  id           x
6  9 0.825083943
7 10 0.000701853

mergeCheck() alerted us to the specific non matching ID numbers. Let’s consider a case with duplicate key values:

df1
    id          x
1    1 -0.2013404
2    2  1.3976885
3    3 -0.6415373
4 <NA> -0.3933596
5  NaN -0.5498042
6       0.8819304
7       0.8919925
df2
  id           x
1  2 -0.77630334
2  3 -0.91381163
3  4  2.81056380
4  5  0.80810218
5  6 -0.03209555
6  5 -0.03849050
7  6  1.11230932

Notice the duplicate key values in a single dataframe.

mergeCheck(x = df1, y = df2, by = "id")
Merge difficulties detected

Unacceptable key values
df1 
    id          x
4 <NA> -0.3933596
5  NaN -0.5498042
6       0.8819304
Duplicated key values
df2 
  id           x
4  5  0.80810218
5  6 -0.03209555
6  5 -0.03849050
7  6  1.11230932
Unmatched cases from df1 and df2 :
df1 
    id          x
1    1 -0.2013404
4 <NA> -0.3933596
5  NaN -0.5498042
6       0.8819304
7       0.8919925
df2 
  id           x
3  4  2.81056380
4  5  0.80810218
5  6 -0.03209555
6  5 -0.03849050
7  6  1.11230932

mergeCheck alerts us to three potential problems here: A missing key value, duplicate key values, and unmatched cases.

The most convenient element of mergeCheck() is that it maintains the same arguments as merge() once you pass the mergeCheck() test, you can simply erase the “Check” portion of the function and safely perform the merge. Let’s take a look at a succsesful merge check:

df1 <- data.frame(id = c(1:10), x = rnorm(10))
df2 <- data.frame(id = c(1:10), x = rnorm(10))

mergeCheck(x = df1, y = df2, by = "id")
Merge difficulties detected

Unmatched cases from df1 and df2 :
df1 
[1] id x 
<0 rows> (or 0-length row.names)
df2 
[1] id x 
<0 rows> (or 0-length row.names)

Notice mergeCheck() has nothing to point out as problematic.

It is important to note that in many situations mergecheck() may alert us to potential non-matching cases, but don’t fear! If we are conducting a left or right join, we should know well in advance one of our data frames will contain many unmatched IDs. Similarly in an inner join, it is expected we will encounter non-matching ID’s. We should be most concerned with missing key values, as well as duplicate key values.

merge() Summary

  1. INNER JOIN is equivalent to basic merge(all.x=FALSE, all.y = FALSE). Rows without matches are dropped from the resulting data set

  2. LEFT JOIN is equivalent to merge() with all.x = T. Rows in the Right (y) data table are dropped if not paired

  3. RIGHT JOIN is equivalent to merge() with all.y = T. Rows in the left (x) data table are dropped if not paired

  4. Full JOIN is equivalent to merge() with all = T. All rows are incorporated and empty cells are replaced with NULL (NA)

The plyr Package for R

The plyr package (Wickham (2011)), offers a handful of data manipulation tools that are part of a sequence of offerings that is now known as the “tidyverse”.

The join() function in plyr handles left, right, and full merges with results similar to the R base merge(), but with some subtle conveniences.

Join function

The join function in the plyr package provides a middle ground between merge() and SQL’s joining by directly using SQL terminology with the join() function.

Arguments

join accepts the following arguments.

  • join(x, y, by, type, match)
  1. x x data set (left)

  2. y y data set (right)

  3. by key variable

  4. type type of join. Can be “left”, “right”, “inner”, or “full”.

  5. match: how should duplicate ids be matched? Either match just the “first” matching row, or match “all” matching rows. Defaults to “all” for compatibility with merge, but “first” is significantly faster.

Pros and Cons

The plyr package’s join() function has several attractive features

  1. It does not, by default, sort the rows. This, by itself, avoids one of the chronic problems for novices who use merge()

  2. join uses SQL terminology, so its features are more immediately understandable to afficionados of SQL.

  3. join has seemingly-more sensible defaults for handling of keys. Importantly, two key values that are missing are never matched together in the joining process. In comparison, with merge(), users must be careful to specify the incomparables parameter.

  4. Cases have been found in which join() is faster than merge().

The cons of the join function are that some functionality of merge is omitted. Most importantly:

  1. The key variables must have exactly the same names in the two data sets.

Left, Right, Full and Outer joins with plyr::join()

Because join() requires us to have a key variable that has the same name in the two data sets, we need to do some data preparation before merging. Here, we choose to rename the variable surname as name.

colnames(authors) <- gsub("surname", "name", colnames(authors))

After that, we can use the variable “name” as the key.

Left Join with plyr::join

join(x = books, y = authors, by = "name", type = "left")
      name                     title     other_author nationality deceased
1    Tukey Exploratory Data Analysis             <NA>          US      yes
2 Venables Modern Applied Statistics           Ripley   Australia       no
3  Tierney                 LISP-STAT             <NA>          US       no
4   Ripley        Spatial Statistics             <NA>          UK       no
5   Ripley     Stochastic Simulation             <NA>          UK       no
6   McNeil Interactive Data Analysis             <NA>   Australia       no
7   R Core      An Introduction to R Venables & Smith        <NA>     <NA>

Right Join with plyr::join

join(x = books , y= authors , by="name",type = "right")
      name                     title other_author nationality deceased
1    Tukey Exploratory Data Analysis         <NA>          US      yes
2 Venables Modern Applied Statistics       Ripley   Australia       no
3  Tierney                 LISP-STAT         <NA>          US       no
4   Ripley        Spatial Statistics         <NA>          UK       no
5   Ripley     Stochastic Simulation         <NA>          UK       no
6   McNeil Interactive Data Analysis         <NA>   Australia       no

Full Join with plyr::join

join(x = books , y = authors , by="name",type = "full")
      name                     title     other_author nationality deceased
1    Tukey Exploratory Data Analysis             <NA>          US      yes
2 Venables Modern Applied Statistics           Ripley   Australia       no
3  Tierney                 LISP-STAT             <NA>          US       no
4   Ripley        Spatial Statistics             <NA>          UK       no
5   Ripley     Stochastic Simulation             <NA>          UK       no
6   McNeil Interactive Data Analysis             <NA>   Australia       no
7   R Core      An Introduction to R Venables & Smith        <NA>     <NA>

Inner Join with plyr::join

join(x = books, y= authors, by="name",type = "inner")
      name                     title other_author nationality deceased
1    Tukey Exploratory Data Analysis         <NA>          US      yes
2 Venables Modern Applied Statistics       Ripley   Australia       no
3  Tierney                 LISP-STAT         <NA>          US       no
4   Ripley        Spatial Statistics         <NA>          UK       no
5   Ripley     Stochastic Simulation         <NA>          UK       no
6   McNeil Interactive Data Analysis         <NA>   Australia       no

Creating a “long” data set

This demonstrates that the “m:1” merge works as well with join() as it does with merge().

join(x = d_long , y = d_wide , by = "ID", type = "full")
   ID     Region     Industry
1   1  Southeast     Hot_Dogs
2   1 East_Coast     Hot_Dogs
3   1 Deep_South     Hot_Dogs
4   1  Southeast     Hot_Dogs
5   2 West_Coast        Candy
6   2 West_Coast        Candy
7   2 East_Coast        Candy
8   2 West_Coast        Candy
9   3 Deep_South       YoYo's
10  3 East_Coast       YoYo's
11  3  Southeast       YoYo's
12  3 West_Coast       YoYo's
13  4 West_Coast Toilet_Paper
14  4    Midwest Toilet_Paper
15  4 Deep_South Toilet_Paper
16  4 Deep_South Toilet_Paper
merge(x = d_long, y = d_wide, by = "ID", all.x = TRUE, all.y = TRUE)
   ID     Region     Industry
1   1  Southeast     Hot_Dogs
2   1 East_Coast     Hot_Dogs
3   1 Deep_South     Hot_Dogs
4   1  Southeast     Hot_Dogs
5   2 West_Coast        Candy
6   2 West_Coast        Candy
7   2 East_Coast        Candy
8   2 West_Coast        Candy
9   3 Deep_South       YoYo's
10  3 East_Coast       YoYo's
11  3  Southeast       YoYo's
12  3 West_Coast       YoYo's
13  4 West_Coast Toilet_Paper
14  4    Midwest Toilet_Paper
15  4 Deep_South Toilet_Paper
16  4 Deep_South Toilet_Paper

No NA Matching Problems

join(x = x , y = y , by = "ID", type = "inner")
  k1 k2 ID k1 k2
1 NA  1  1 NA NA
2 NA NA  2  2 NA
3  3 NA  3 NA  3
4  4  4  4  4  4
5  5  5  5  5  5
6  6  6  6  6  6

No Sorting Problems

It does not have sorting issues by default

join(x = d, y = d2, by = "ID", type = "inner")
   ID   Company    Profit   Company     Region      Cost
1   1 Company_A  89.35583 Company_A Deep_South  797.7308
2   2 Company_B 102.83809 Company_B East_Coast 1101.8534
3   3 Company_A  98.06931 Company_A East_Coast  896.0806
4   4 Company_D 110.61756 Company_D Deep_South  967.5251
5   5 Company_D  96.34582 Company_D West_Coast 1158.7691
6   6 Company_B 107.56797 Company_B Deep_South  995.3901
7   7 Company_D  74.31186 Company_D  Southeast  874.8078
8   8 Company_D  95.65268 Company_D East_Coast 1164.3170
9   9 Company_D 101.90891 Company_D    Midwest  901.2424
10 10 Company_A 109.27751 Company_A East_Coast 1060.5660
11 11 Company_B 102.53551 Company_B East_Coast  804.7798
12 12 Company_C  98.26921 Company_C  Southeast  976.1280
13 13 Company_B  86.09252 Company_B    Midwest 1025.3395
14 14 Company_C  88.54110 Company_C  Southeast 1023.7806
15 15 Company_A  95.08599 Company_A  Southeast  765.8460
16 16 Company_D 102.08956 Company_D  Southeast 1138.5084
17 17 Company_A  73.76429 Company_A Deep_South  919.6904
18 18 Company_D  97.95117 Company_D Deep_South 1001.0430
19 19 Company_A  94.66364 Company_A Deep_South 1004.8157
20 20 Company_C  93.46937 Company_C  Southeast  918.1178
21 21 Company_C 112.19280 Company_C East_Coast 1125.9425
22 22 Company_A 116.77361 Company_A East_Coast 1027.5019
23 23 Company_B  99.19894 Company_B West_Coast 1020.3370
24 24 Company_C  97.37098 Company_C  Southeast  879.4884
25 25 Company_C 105.48620 Company_C East_Coast 1269.7293
26 26 Company_D 101.03255 Company_D    Midwest 1020.5146
27 27 Company_D 101.25741 Company_D Deep_South  750.3661
28 28 Company_C  91.17284 Company_C Deep_South  921.7680
29 29 Company_C 102.61665 Company_C    Midwest  841.4575
30 30 Company_A  89.43695 Company_A  Southeast 1197.9956

The data.table Package for R

The data.table package (Dowle and Srinivasan (2017)) for R offers another method for merging data. The CRMDA has prepared a separate guide, with working examples, for data.table Data Table Guide: Basic Usage and Advanced Topics. The data.table package has been especially helpful in dealing with very large data structures, reducing runtime from hours to minutes in several cases. For that reason, we typically prefer to use the data.table package when working with large datasets (several million rows). The ‘data.table’ package merges data frames (which are first converted to data tables) faster than the aforementioned methods. We suggest using this link to learn more about ‘data.table’: https://www.datacamp.com/courses/data-analysis-the-data-table-way.

Example Data

Here we replicate some of the previous examples with ‘data.table’ to demonstrate its use.

library(data.table)

We can take the authors and books data frames from before and convert them to data table objects. We must also set the keys for these data tables. The key(s) indicates which variables to sort the data table by and what to use as a key when merging.

authors
   surname nationality deceased
1    Tukey          US      yes
2 Venables   Australia       no
3  Tierney          US       no
4   Ripley          UK       no
5   McNeil   Australia       no
books
      name                     title     other_author
1    Tukey Exploratory Data Analysis             <NA>
2 Venables Modern Applied Statistics           Ripley
3  Tierney                 LISP-STAT             <NA>
4   Ripley        Spatial Statistics             <NA>
5   Ripley     Stochastic Simulation             <NA>
6   McNeil Interactive Data Analysis             <NA>
7   R Core      An Introduction to R Venables & Smith
authors <- as.data.table(authors)
books <- as.data.table(books)
setkey(authors, surname)
setkey(books, name)

Left, Inner, and Full Joins with data.table

Left Join

It appears backwards, but here is a left join of books (y) onto authors (x).

books[authors]
       name                     title other_author nationality deceased
1:   McNeil Interactive Data Analysis           NA   Australia       no
2:   Ripley        Spatial Statistics           NA          UK       no
3:   Ripley     Stochastic Simulation           NA          UK       no
4:  Tierney                 LISP-STAT           NA          US       no
5:    Tukey Exploratory Data Analysis           NA          US      yes
6: Venables Modern Applied Statistics       Ripley   Australia       no

Right Join

We can reverse it for a right join.

authors[books]
    surname nationality deceased                     title
1:   McNeil   Australia       no Interactive Data Analysis
2:   R Core          NA       NA      An Introduction to R
3:   Ripley          UK       no        Spatial Statistics
4:   Ripley          UK       no     Stochastic Simulation
5:  Tierney          US       no                 LISP-STAT
6:    Tukey          US      yes Exploratory Data Analysis
7: Venables   Australia       no Modern Applied Statistics
       other_author
1:               NA
2: Venables & Smith
3:               NA
4:               NA
5:               NA
6:               NA
7:           Ripley

Inner Join

And specifying ‘nomatch = 0’ indicates an inner join.

authors[books, nomatch=0]
    surname nationality deceased                     title other_author
1:   McNeil   Australia       no Interactive Data Analysis           NA
2:   Ripley          UK       no        Spatial Statistics           NA
3:   Ripley          UK       no     Stochastic Simulation           NA
4:  Tierney          US       no                 LISP-STAT           NA
5:    Tukey          US      yes Exploratory Data Analysis           NA
6: Venables   Australia       no Modern Applied Statistics       Ripley

What about keys with missing values?

Here we have two data frames with information about NFL quarterbacks and how many Super Bowls and league MVP awards they have won.

x <- data.frame("Name" = c("Johnson", NA, "Brady", "Ryan", NA, "Goff"),
                "Titles" = c(1, 0, 5, 0, 1, 0))
                
y <- data.frame("LastName" = c(NA, "Marino", "Brady", "Ryan", "Wilson",
                               "Bryant", NA),
                "MVPs" = c(0, NA, 3, 0, 0, 0, 0))
x
     Name Titles
1 Johnson      1
2    <NA>      0
3   Brady      5
4    Ryan      0
5    <NA>      1
6    Goff      0
y
  LastName MVPs
1     <NA>    0
2   Marino   NA
3    Brady    3
4     Ryan    0
5   Wilson    0
6   Bryant    0
7     <NA>    0

NA Matching Problems

Unfortunately, matches occur on with missing key values.

x <- as.data.table(x)
y <- as.data.table(y)
setkey(x, Name)
setkey(y, LastName)

## Left join
y[x]
   LastName MVPs Titles
1:       NA    0      0
2:       NA    0      0
3:       NA    0      1
4:       NA    0      1
5:    Brady    3      5
6:     Goff   NA      0
7:  Johnson   NA      1
8:     Ryan    0      0
## Right join
x[y]
     Name Titles MVPs
1:     NA      0    0
2:     NA      1    0
3:     NA      0    0
4:     NA      1    0
5:  Brady      5    3
6: Bryant     NA    0
7: Marino     NA   NA
8:   Ryan      0    0
9: Wilson     NA    0
## Inner join
y[x, nomatch = 0]
   LastName MVPs Titles
1:       NA    0      0
2:       NA    0      0
3:       NA    0      1
4:       NA    0      1
5:    Brady    3      5
6:     Ryan    0      0

Clearly, missing values in keys is going to be a problem. Removing cases with missing prior to merging might be the best option.

Multiple rows with same key value

The data.table package allows multiple key values to be specified. This can be useful when a single variable does not uniquely identify each row. In the data created here there are two cases with the last name “Manning.”

x <- data.frame("First" = c("Peyton", "Eli", "Tom", "Matt"),
                "Last" = c("Manning", "Manning", "Brady", "Ryan"),
                "Titles" = c(2, 2, 5, 0))

y <- data.frame("FirstName" = c("Peyton", "Tom", "Matt", "Russell",
                                "Dez", "Von"),
                "LastName" = c("Manning", "Brady", "Ryan", "Wilson",
                               "Bryant", "Miller"),
                "MVPs" = c(3, 3, 1, 0, 0, 0))
x
   First    Last Titles
1 Peyton Manning      2
2    Eli Manning      2
3    Tom   Brady      5
4   Matt    Ryan      0
y
  FirstName LastName MVPs
1    Peyton  Manning    3
2       Tom    Brady    3
3      Matt     Ryan    1
4   Russell   Wilson    0
5       Dez   Bryant    0
6       Von   Miller    0

If we were to merge the data tables using only the last name variables as the key, the results would be strange.

x <- data.table(x)
y <- data.table(y)
setkey(x, Last)
setkey(y, LastName)

## Left join
y[x]
   FirstName LastName MVPs  First Titles
1:       Tom    Brady    3    Tom      5
2:    Peyton  Manning    3 Peyton      2
3:    Peyton  Manning    3    Eli      2
4:      Matt     Ryan    1   Matt      0
## Right join
x[y]
    First    Last Titles FirstName MVPs
1:    Tom   Brady      5       Tom    3
2:     NA  Bryant     NA       Dez    0
3: Peyton Manning      2    Peyton    3
4:    Eli Manning      2    Peyton    3
5:     NA  Miller     NA       Von    0
6:   Matt    Ryan      0      Matt    1
7:     NA  Wilson     NA   Russell    0
## Inner join
y[x, nomatch = 0]
   FirstName LastName MVPs  First Titles
1:       Tom    Brady    3    Tom      5
2:    Peyton  Manning    3 Peyton      2
3:    Peyton  Manning    3    Eli      2
4:      Matt     Ryan    1   Matt      0

Check to see if another variable can be used as a key!

Are the values in ‘x[ , Last]’ unique?

nrow(x) == length(unique(x[ , Last]))
[1] FALSE

How about ‘x[ , Last, First]’?

nrow(x) == nrow(unique(x[ , Last, First]))
[1] TRUE

These two variables uniquely identify each row! We can use both of them as keys.

setkey(x, Last, First)
setkey(y, LastName, FirstName)
## Left join
y[x]
   FirstName LastName MVPs Titles
1:       Tom    Brady    3      5
2:       Eli  Manning   NA      2
3:    Peyton  Manning    3      2
4:      Matt     Ryan    1      0
## Right join
x[y]
     First    Last Titles MVPs
1:     Tom   Brady      5    3
2:     Dez  Bryant     NA    0
3:  Peyton Manning      2    3
4:     Von  Miller     NA    0
5:    Matt    Ryan      0    1
6: Russell  Wilson     NA    0

data.table Summary

The ‘data.table’ package provides the same output from merging as the previously shown methods (with the exception ‘data.table’ producing ‘data.table’ objects, not ‘data.frame’ objects). The major benefit of using ‘data.table’ is the increase in speed when working with large data frames.

References

Dowle, Matt, and Arun Srinivasan. 2017. Data.table: Extension of ‘Data.frame‘. https://CRAN.R-project.org/package=data.table.

R Core Team. 2017. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Sharp, Elaine B., and Paul E. Johnson. 2009. “Accounting for Variation in Distrust of Local Police.” Justice Quarterly 26 (1): 157–82. doi:10.1080/07418820802290496.

Wickham, Hadley. 2011. “The Split-Apply-Combine Strategy for Data Analysis.” Journal of Statistical Software 40 (1): 1–29. http://www.jstatsoft.org/v40/i01/.