7

I would like to merge two dataframes. There are some shared variables and some different variables and there are different numbers of rows in each dataframe. The dataframes share some rows, but not all. And both dataframes have missing data that the other my have.

DF1:

name age weight height
Tim 7 54 112
Dave 5 50 NA
Larry NA 42 73
Rob 1 30 43

DF2:

name age weight height grade
Tim 7 NA 112 2
Dave NA 50 103 1
Larry 3 NA 73 NA
Rob 1 30 NA NA
John 6 60 NA 1
Tom 8 61 112 2

I want to merge these two dataframes together by the shared columns (name, age, weight, and height). However, I want NAs to be overridden, such that if one of the two dataframes has a value where the other has NA, I want the value to be carried through into the third dataframe. Ideally, the last dataframe should only have NAs when both DF1 and DF2 had NAs in that same location.

Ideal Data Frame

name age weight height grade
Tim 7 54 112 2
Dave 5 50 103 1
Larry 3 42 73 NA
Rob 1 30 43 NA
John 6 60 NA 1
Tom 8 61 112 2

I've been using full_join and left_join, but I don't know how to merge these in such a way that NAs are replaced with actual data (if it is present in one of the dataframes). Is there a way to do this?

4
  • 2
    Does this answer your question? merge two uneven dataframes by ID and fill in missing values Commented Apr 27, 2022 at 13:22
  • 1
    You could do a "coalescing join" alistaire.rbind.io/blog/coalescing-joins Commented Apr 27, 2022 at 13:24
  • What should happen if df1 and df2 both contain non-NA values, but different? Also, I am assuming the "name" column contains unique values? Commented Apr 27, 2022 at 14:14
  • Has this issue been resolved? You can choose the answer that best meets your requirements and click its checkmark button. Commented Dec 25, 2024 at 3:56

3 Answers 3

6

This is a typical case that rows_patch() from dplyr can treat.

library(dplyr)

rows_patch(df2, df1, by = "name")

   name age weight height grade
1   Tim   7     54    112     2
2  Dave   5     50    103     1
3 Larry   3     42     73    NA
4   Rob   1     30     43    NA
5  John   6     60     NA     1
6   Tom   8     61    112     2

Data
df1 <- structure(list(name = c("Tim", "Dave", "Larry", "Rob"), age = c(7L, 
5L, NA, 1L), weight = c(54L, 50L, 42L, 30L), height = c(112L, 
NA, 73L, 43L)), class = "data.frame", row.names = c(NA, -4L))

df2 <- structure(list(name = c("Tim", "Dave", "Larry", "Rob", "John", 
"Tom"), age = c(7L, NA, 3L, 1L, 6L, 8L), weight = c(NA, 50L, 
NA, 30L, 60L, 61L), height = c(112L, 103L, 73L, NA, NA, 112L), 
grade = c(2L, 1L, NA, NA, 1L, 2L)), class = "data.frame", row.names = c(NA, -6L))
Sign up to request clarification or add additional context in comments.

Comments

1

Another possible solution:

library(tidyverse)

df2 %>% 
  bind_rows(df1) %>% 
  group_by(name) %>% 
  fill(age:grade, .direction = "updown") %>% 
  ungroup %>% 
  distinct

#> # A tibble: 6 x 5
#>   name    age weight height grade
#>   <chr> <int>  <int>  <int> <int>
#> 1 Tim       7     54    112     2
#> 2 Dave      5     50    103     1
#> 3 Larry     3     42     73    NA
#> 4 Rob       1     30     43    NA
#> 5 John      6     60     NA     1
#> 6 Tom       8     61    112     2

Comments

0

I like the powerjoin package suggested as an answer to the question in the first comment, which I had never heard of before.

However, if you want to avoid using extra packages, you can do it in base R. This approach also avoids having to explicitly name each column - the dplyr approaches suggested in the comments do not do that, although perhaps could be modified.

# Load data

df1  <- read.table(text = "name age weight  height
Tim 7   54  112
Dave    5   50  NA
Larry   NA  42  73
Rob 1   30  43", header=TRUE)
df2  <- read.table(text = "name age weight  height  grade
Tim 7   NA  112 2
Dave    NA  50  103 1
Larry   3   NA  73  NA
Rob 1   30  NA  NA
John    6   60  NA  1
Tom 8   61  112 2", header=TRUE)


df3  <- merge(df1, df2, by = "name", all = TRUE, sort=FALSE)

# Coalesce the common columns
common_cols  <- names(df1)[names(df1)!="name"]
df3[common_cols]  <- lapply(common_cols, function(col) {
    coalesce(df3[[paste0(col, ".x")]], df3[[paste0(col, ".y")]])
}) 

# Select desired columns
df3[names(df2)]

#    name age weight height grade
# 1   Tim   7     54    112     2
# 2  Dave   5     50    103     1
# 3 Larry   3     42     73    NA
# 4   Rob   1     30     43    NA
# 5  John   6     60     NA     1
# 6   Tom   8     61    112     2

There are advantages to using base R, but powerjoin looks like an interesting package too.

2 Comments

Note that coalesce is from the dplyr package
Good point! That's what happens when you have dplyr loaded all the time!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.