Merge two dataframes containing duplicate elements

Question

Given two dataframes whose names overlap partially, foo and bar:

foo <- iris[1:10,-c(4,5)]
#   Sepal.Length Sepal.Width Petal.Length
# 1           5.1         3.5          1.4
# 2           4.9         3.0          1.4
# 3           4.7         3.2          1.3
# 4           4.6         3.1          1.5
# 5           5.0         3.6          1.4
# 6           5.4         3.9          1.7
# 7           4.6         3.4          1.4
# 8           5.0         3.4          1.5
# 9           4.4         2.9          1.4
# 10          4.9         3.1          1.5

bar <- iris[3:13,-c(3,5)]
bar[1:8, ] <- bar[1:8, ] * 2
#    Sepal.Length Sepal.Width Petal.Width
# 3           9.4         6.4         0.4
# 4           9.2         6.2         0.4
# 5          10.0         7.2         0.4
# 6          10.8         7.8         0.8
# 7           9.2         6.8         0.6
# 8          10.0         6.8         0.4
# 9           8.8         5.8         0.4
# 10          9.8         6.2         0.2
# 11          5.4         3.7         0.2
# 12          4.8         3.4         0.2
# 13          4.8         3.0         0.1

How can I merge the dataframes such that both rows and columns are padded for missing cases, while prioritising the results of one dataframe for overlapping elements? In this example, it is the overlapping results in bar that I wish to prioritise.

merge(..., by = "row.names", all = TRUE) is close, in that it retains all 13 rows, and returns missing values as NA:

foobar <- merge(foo, bar, by = "row.names", all = TRUE)
#    Row.names Sepal.Length.x Sepal.Width.x Petal.Length Sepal.Length.y Sepal.Width.y Petal.Width
# 1          1            5.1           3.5          1.4             NA            NA          NA
# 2         10            4.9           3.1          1.5            9.8           6.2         0.2
# 3         11             NA            NA           NA            5.4           3.7         0.2
# 4         12             NA            NA           NA            4.8           3.4         0.2
# 5         13             NA            NA           NA            4.8           3.0         0.1
# 6          2            4.9           3.0          1.4             NA            NA          NA
# 7          3            4.7           3.2          1.3            9.4           6.4         0.4
# 8          4            4.6           3.1          1.5            9.2           6.2         0.4
# 9          5            5.0           3.6          1.4           10.0           7.2         0.4
# 10         6            5.4           3.9          1.7           10.8           7.8         0.8
# 11         7            4.6           3.4          1.4            9.2           6.8         0.6
# 12         8            5.0           3.4          1.5           10.0           6.8         0.4
# 13         9            4.4           2.9          1.4            8.8           5.8         0.4

However, it creates a distinct column for each column in the constituent dataframes, regardless of the fact that they share names.

The desired output would be as such:

#    Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1           5.1         3.5          1.4          NA # unique to foo
# 2           4.9         3.0          1.4          NA # unique to foo
# 3           9.4         6.4          1.3          0.4 # overlap, retained from bar
# 4           9.2         6.2          1.5          0.4 # 
# 5          10.0         7.2          1.4          0.4 # .
# 6          10.8         7.8          1.7          0.8 # .
# 7           9.2         6.8          1.4          0.6 # .
# 8          10.0         6.8          1.5          0.4 # 
# 9           8.8         5.8          1.4          0.4 # 
# 10          9.8         6.2          1.5          0.2 # overlap, retained from bar
# 11          5.4         3.7           NA          0.2 # unique to bar
# 12          4.8         3.4           NA          0.2 # unique to bar
# 13          4.8         3.0           NA          0.1 # unique to bar

My intuition is to subset the data into two disjoint sets, and the set of intersecting elements in bar, then merge these, but I'm sure there is a more elegant solution!

shirewoman2 · Accepted Answer · 2014-09-11 23:11:37Z

1

(Edited) The package plyr is awesome for this sort of thing. Just do:

 library(plyr)
 foo$ID <- row.names(foo)
 bar$ID <- row.names(bar)
 foobar <- join(foo, bar, type = "full", by = "ID")

Joining by row.names didn't work, as Flodl noted in the comments, so that's why I made a new column "ID".

edited Sep 11, 2014 at 23:11

answered Sep 11, 2014 at 22:45

shirewoman2

1,9604 gold badges21 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

flodel Over a year ago

Error in [.data.frame(x, by) : undefined columns selected

IRTFM Over a year ago

Furthermore, the help page suggests that we should expect the result to be the same as from merge.

flodel Over a year ago

Now this is not doing the overwriting like the OP wants. Please test and compare with his expected output.

shirewoman2 Over a year ago

Ah, I see... Yes, I think any solution I would have wouldn't be any better than the one voidHead is thinking of.

flodel Over a year ago

join(bar, foo, type = "full", by = "ID", match = "first") seems more like it. If the OP does not care for the order of the rows and columns.

|

IRTFM · Accepted Answer · 2014-09-11 23:12:56Z

1

I see the glowing recommendation for plyr::join but do not see how it is much different than what the base merge offers:

 merge(foo, bar, by=c("Sepal.Length", "Sepal.Width"), all=TRUE)

answered Sep 11, 2014 at 23:12

IRTFM

264k22 gold badges381 silver badges503 bronze badges

5 Comments

flodel Over a year ago

Well, it is clearly not what the OP wants. Just compare your output with the OP's.

IRTFM Over a year ago

Agreed not clear. I assumed that the difference in the Petal.Width values were explained by laziness on the part of the OP. The missing calculated text values are explained by laziness on my part.

x4nd3r Over a year ago

@BondedDust Which Petal.Width values are you referring to? I constructed the expected output by hand, but I believe it's consistent with the example data.

IRTFM Over a year ago

All those Petal.Length values less than 1.0. There are none such in the original.

x4nd3r Over a year ago

Right you are. Corrected.

Collectives™ on Stack Overflow

Merge two dataframes containing duplicate elements

2 Answers 2

6 Comments

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related