Merge two data tables but avoid duplicate columns

Question

I have two data tables, dat and dat2. There are a few columns that appear in both tables, though the values are not necessarily the same in each.

When I merge the two tables using dat[dat2] everything works as expected, except that I have some duplicate column names. For instance, a column named Status appears in both tables and, when merged, the column from dat2 appears as i.Status. Rather than renaming these columns, I simply want to drop them from the table altogether. What is the simplest way to do this?

You can drop them after the merge or you can subset during the merge. — lmo
– lmo, Commented Jun 29, 2016 at 16:27
Simplest way is to provide columns which you want to keep from join to j argument. — jangorecki
– jangorecki, Commented Jun 29, 2016 at 16:33
I'm trying to avoid manually typing out all of the column names. I can drop columns that start with i. after the merge, but that's a hack — Jeff
– Jeff, Commented Jun 29, 2016 at 16:59

lmo · Accepted Answer · 2016-06-29 17:06:05Z

2

Below is some code to illustrate methods for the two scenarios I mentioned, though there may be some fancier (more efficient) data.table (version 1.9.6) methods.

Both methods will dynamically adapt to the variable overlap, so you don't have to worry about manually typing out the names.

# get some data
set.seed(1234)
dt <- data.table(id=1:10, a=letters[1:10], b=rnorm(10), d=rnorm(10))
dt2 <- data.table(id=1:10, a=letters[5:14], c=rnorm(10), d=rnorm(10))

Here's the data without dropping:

dt[dt2, on="id"]

    id a          b           d i.a          c        i.d
 1:  1 a -1.2070657 -0.47719270   e  0.1340882  1.1022975
 2:  2 b  0.2774292 -0.99838644   f -0.4906859 -0.4755931
 3:  3 c  1.0844412 -0.77625389   g -0.4405479 -0.7094400
 4:  4 d -2.3456977  0.06445882   h  0.4595894 -0.5012581
 5:  5 e  0.4291247  0.95949406   i -0.6937202 -1.6290935
 6:  6 f  0.5060559 -0.11028549   j -1.4482049 -1.1676193
 7:  7 g -0.5747400 -0.51100951   k  0.5747557 -2.1800396
 8:  8 h -0.5466319 -0.91119542   l -1.0236557 -1.3409932
 9:  9 i -0.5644520 -0.83717168   m -0.0151383 -0.2942939
10: 10 j -0.8900378  2.41583518   n -0.9359486 -0.4658975

method 1: subset during the merge / join using the intersect and mget functions.

# assuming your id variable is the first column in both sets:
dropVars <- intersect(names(dt), names(dt2))[-1]

dt[dt2[, mget(names(dt2)[-which(names(dt2) %in% dropVars)])], on="id"]

method 2: drop after merge using grep

dt3 <- dt[dt2, on="id"]
dt3[, grep("^i\\.", names(dt3), value=TRUE) := NULL]

Both of these methods return

    id a          b           d          c
 1:  1 a -1.2070657 -0.47719270  0.1340882
 2:  2 b  0.2774292 -0.99838644 -0.4906859
 3:  3 c  1.0844412 -0.77625389 -0.4405479
 4:  4 d -2.3456977  0.06445882  0.4595894
 5:  5 e  0.4291247  0.95949406 -0.6937202
 6:  6 f  0.5060559 -0.11028549 -1.4482049
 7:  7 g -0.5747400 -0.51100951  0.5747557
 8:  8 h -0.5466319 -0.91119542 -1.0236557
 9:  9 i -0.5644520 -0.83717168 -0.0151383
10: 10 j -0.8900378  2.41583518 -0.9359486

answered Jun 29, 2016 at 17:06

lmo

38.6k9 gold badges63 silver badges76 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

David Arenburg Over a year ago

or keep <- union(names(dt), names(dt2)) ; dt[dt2, mget(keep), on = "id"]

lmo Over a year ago

@DavidArenburg That's better, it avoids the irritating -which() syntax.

Collectives™ on Stack Overflow

Merge two data tables but avoid duplicate columns

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related