2

Let's say I have the following dfs

df1:
a b c d
1 2 3 4
4 3 3 4
9 7 3 4

df2:
a b c d
1 2 3 4
2 2 3 4
3 2 3 4

Now I want to merge both dfs conditional of column "a" to give me the following df

a b c d
1 2 3 4
4 3 3 4
9 7 3 4
2 2 3 4
3 2 3 4

In my dataset i tried using

merge <- merge(x = df1, y = df2, by = "a", all = TRUE)

However, while df1 has 50,000 entries and df2 has 100,000 entries and there are definately matching values in column a the merged df has over one million entries. I do not understand this. As I understand there should be max. 150,000 entries in the merged df and this is the case when no values in column a are equal between the two dfs.

1
  • Using the example datasets you provided above, the example works. This means it is likely something to do with your datasets' format/structure. What is the output of str(df1) and str(df2)? One thing to be conscious of is that if the column names are the same in the two datasets it will make new ones by adding .y to the columns in the second dataset. I'm assuming this was just for the example though. Commented Mar 16, 2020 at 17:24

4 Answers 4

3

I think what you want to do is not mergebut rather rbind the two dataframes and remove the duplicated rows:

DATA:

df1 <- data.frame(a = c(1,4,9),
                  b = c(2,3,7),
                  c = c(3,3,3),
                  d = c(4,4,4))
df2 <- data.frame(a = c(1,2,3),
                  b = c(2,2,2),
                  c = c(3,3,3),
                  d = c(4,4,4))

SOLUTION:

Row-bind df1and df2:

df3 <- rbind(df1, df2)

Remove the duplicate rows:

df3 <- df3[!duplicated(df3), ]

RESULT:

df3
  a b c d
1 1 2 3 4
2 4 3 3 4
3 9 7 3 4
5 2 2 3 4
6 3 2 3 4
Sign up to request clarification or add additional context in comments.

Comments

1

With tidyverse, we can do bind_rows and distinct

library(dplyr)
bind_rows(df1, df2) %>%
     distinct

data

df1 <- structure(list(a = c(1, 4, 9), b = c(2, 3, 7), c = c(3, 3, 3), 
    d = c(4, 4, 4)), class = "data.frame", row.names = c(NA, 
-3L))

df2 <- structure(list(a = c(1, 2, 3), b = c(2, 2, 2), c = c(3, 3, 3), 
    d = c(4, 4, 4)), class = "data.frame", row.names = c(NA, 
-3L))

Comments

0

it is possible so

dplyr::union(df1, df2)

Comments

0

here is another base R solution using rbind + %in%

dfout <- rbind(df1,subset(df2,!a %in% df1$a))

such that

> rbind(df1,subset(df2,!a %in% df1$a))
   a b c d
1  1 2 3 4
2  4 3 3 4
3  9 7 3 4
21 2 2 3 4
31 3 2 3 4

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.