Replacing NAs with existing data when merging two dataframes in R [duplicate]

Question

I would like to merge two dataframes. There are some shared variables and some different variables and there are different numbers of rows in each dataframe. The dataframes share some rows, but not all. And both dataframes have missing data that the other my have.

DF1:

name	age	weight	height
Tim	7	54	112
Dave	5	50	NA
Larry	NA	42	73
Rob	1	30	43

DF2:

name	age	weight	height	grade
Tim	7	NA	112	2
Dave	NA	50	103	1
Larry	3	NA	73	NA
Rob	1	30	NA	NA
John	6	60	NA	1
Tom	8	61	112	2

I want to merge these two dataframes together by the shared columns (name, age, weight, and height). However, I want NAs to be overridden, such that if one of the two dataframes has a value where the other has NA, I want the value to be carried through into the third dataframe. Ideally, the last dataframe should only have NAs when both DF1 and DF2 had NAs in that same location.

Ideal Data Frame

name	age	weight	height	grade
Tim	7	54	112	2
Dave	5	50	103	1
Larry	3	42	73	NA
Rob	1	30	43	NA
John	6	60	NA	1
Tom	8	61	112	2

I've been using full_join and left_join, but I don't know how to merge these in such a way that NAs are replaced with actual data (if it is present in one of the dataframes). Is there a way to do this?

Does this answer your question? merge two uneven dataframes by ID and fill in missing values — benson23
– benson23, Commented Apr 27, 2022 at 13:22
You could do a "coalescing join" alistaire.rbind.io/blog/coalescing-joins — Skaqqs
– Skaqqs, Commented Apr 27, 2022 at 13:24
What should happen if df1 and df2 both contain non-NA values, but different? Also, I am assuming the "name" column contains unique values? — Ottie
– Ottie, Commented Apr 27, 2022 at 14:14
Has this issue been resolved? You can choose the answer that best meets your requirements and click its checkmark button. — Darren Tsai
– Darren Tsai, Commented Dec 25, 2024 at 3:56

Darren Tsai · Accepted Answer · 2022-04-27 13:34:02Z

6

This is a typical case that rows_patch() from dplyr can treat.

library(dplyr)

rows_patch(df2, df1, by = "name")

   name age weight height grade
1   Tim   7     54    112     2
2  Dave   5     50    103     1
3 Larry   3     42     73    NA
4   Rob   1     30     43    NA
5  John   6     60     NA     1
6   Tom   8     61    112     2

Data

df1 <- structure(list(name = c("Tim", "Dave", "Larry", "Rob"), age = c(7L, 
5L, NA, 1L), weight = c(54L, 50L, 42L, 30L), height = c(112L, 
NA, 73L, 43L)), class = "data.frame", row.names = c(NA, -4L))

df2 <- structure(list(name = c("Tim", "Dave", "Larry", "Rob", "John", 
"Tom"), age = c(7L, NA, 3L, 1L, 6L, 8L), weight = c(NA, 50L, 
NA, 30L, 60L, 61L), height = c(112L, 103L, 73L, NA, NA, 112L), 
grade = c(2L, 1L, NA, NA, 1L, 2L)), class = "data.frame", row.names = c(NA, -6L))

answered Apr 27, 2022 at 13:34

Darren Tsai

36.6k6 gold badges27 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

PaulS · Accepted Answer · 2022-04-27 14:17:20Z

1

Another possible solution:

library(tidyverse)

df2 %>% 
  bind_rows(df1) %>% 
  group_by(name) %>% 
  fill(age:grade, .direction = "updown") %>% 
  ungroup %>% 
  distinct

#> # A tibble: 6 x 5
#>   name    age weight height grade
#>   <chr> <int>  <int>  <int> <int>
#> 1 Tim       7     54    112     2
#> 2 Dave      5     50    103     1
#> 3 Larry     3     42     73    NA
#> 4 Rob       1     30     43    NA
#> 5 John      6     60     NA     1
#> 6 Tom       8     61    112     2

answered Apr 27, 2022 at 14:17

PaulS

27.1k3 gold badges19 silver badges40 bronze badges

Comments

SamR · Accepted Answer · 2022-04-27 13:33:18Z

0

I like the powerjoin package suggested as an answer to the question in the first comment, which I had never heard of before.

However, if you want to avoid using extra packages, you can do it in base R. This approach also avoids having to explicitly name each column - the dplyr approaches suggested in the comments do not do that, although perhaps could be modified.

# Load data

df1  <- read.table(text = "name age weight  height
Tim 7   54  112
Dave    5   50  NA
Larry   NA  42  73
Rob 1   30  43", header=TRUE)
df2  <- read.table(text = "name age weight  height  grade
Tim 7   NA  112 2
Dave    NA  50  103 1
Larry   3   NA  73  NA
Rob 1   30  NA  NA
John    6   60  NA  1
Tom 8   61  112 2", header=TRUE)


df3  <- merge(df1, df2, by = "name", all = TRUE, sort=FALSE)

# Coalesce the common columns
common_cols  <- names(df1)[names(df1)!="name"]
df3[common_cols]  <- lapply(common_cols, function(col) {
    coalesce(df3[[paste0(col, ".x")]], df3[[paste0(col, ".y")]])
}) 

# Select desired columns
df3[names(df2)]

#    name age weight height grade
# 1   Tim   7     54    112     2
# 2  Dave   5     50    103     1
# 3 Larry   3     42     73    NA
# 4   Rob   1     30     43    NA
# 5  John   6     60     NA     1
# 6   Tom   8     61    112     2

There are advantages to using base R, but powerjoin looks like an interesting package too.

answered Apr 27, 2022 at 13:33

SamR

23.1k4 gold badges23 silver badges55 bronze badges

2 Comments

benson23 Over a year ago

Note that coalesce is from the dplyr package

SamR Over a year ago

Good point! That's what happens when you have dplyr loaded all the time!

Collectives™ on Stack Overflow

Replacing NAs with existing data when merging two dataframes in R [duplicate]

3 Answers 3

Data

Comments

Comments

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Data

Comments

Comments

2 Comments

Linked

Related