Update existing data.frame with values from another one if missing

Question

I'm looking for the (1) name and (2) a (cleaner) method in R (base and data.table preferred) of the following.

Input

> d1
  id  x  y
1  1  1 NA
2  2 NA  3
3  3  4 NA
> d2
  id  x  y z
1  4 NA 30 a
2  3 20  2 b
3  2 14 NA c
4  1 15 97 d

(note that the actual data.frames have hundreds of columns)

Expected output:

> d1
  id  x  y z
1  1  1 97 d
2  2 14  3 c
3  3  4  2 b

Data and current solution:

d1 <- data.frame(id = 1:3, x = c(1, NA, 4), y = c(NA, 3, NA))
d2 <- data.frame(id = 4:1, x = c(NA, 20, 14, 15), y = c(30, 2, NA, 97), z = letters[1:4])

for (col in setdiff(names(d1), "id")) {
  # If missing look in d2
  missing <- is.na(d1[[col]])
  d1[missing, col] <- d2[match(d1$id[missing], d2$id), col]
} 
for (col in setdiff(names(d2), names(d1))) {
  # If column missing then add
  d1[[col]] <- d2[match(d1$id, d2$id), col]
}

PS:

Likely this questions has been asked before but I'm lacking in vocabulary to search it.

maybe the word u are looking for is coalesce?

chinsoon12
– chinsoon12

2018-05-31 08:40:33 +00:00
Commented May 31, 2018 at 8:40 — chinsoon12
– chinsoon12, Commented May 31, 2018 at 8:40

chinsoon12 · Accepted Answer · 2018-06-01 01:44:54Z

2

Assuming you are working with 2 data.frames, here is a base solution

#expand d1 to have the same columns as d2
d <- merge(d1, d2[, c("id", setdiff(names(d2), names(d1))), drop=FALSE], 
    by="id", all.x=TRUE, all.y=FALSE)

#make sure that d2 also have same number of columns as d1
d2 <- merge(d2, d1[, c("id", setdiff(names(d1), names(d2))), drop=FALSE], 
    by="id", all.x=TRUE, all.y=FALSE)

#align rows and columns to match those in d1
mask <- d2[match(d1$id, d2$id), names(d)]

#replace NAs with those mask
replace(d, is.na(d), mask[is.na(d)])

If you dont mind, we can rewrite your question into a general matrix-coalesce question (i.e. any number of matrices, columns, rows) which seems like it has not been asked before.

edit:

Another base R solution is a hack of coalesce1a from How to implement coalesce efficiently in R

coalesce.mat <- function(...) {
    ans <- ..1  
    for (elt in list(...)[-1]) {
        rn <- match(ans$id, elt$id)
        ans[is.na(ans)] <- elt[rn, names(ans)][is.na(ans)]
    }
    ans         
}

allcols <- Reduce(union, lapply(list(d1, d2), names))
do.call(coalesce.mat, 
    lapply(list(d1, d2), function(x) {
        x[, setdiff(allcols, names(x))] <- NA
        x 
    }))

edit:

a possible data.table solution using coalesce1a from How to implement coalesce efficiently in R by Martin Morgan.

coalesce1a <- function(...) {
    ans <- ..1
    for (elt in list(...)[-1]) {
        i <- which(is.na(ans))
        ans[i] <- elt[i]
    }
    ans
}

setDT(d1)
setDT(d2)

#melt into long formats and full outer join the 2
mdt <- merge(melt(d1, id.vars="id"), melt(d2, id.vars="id"), by=c("id","variable"), all=TRUE)

#perform a coalesce on vectors
mdt[, value := do.call(coalesce1a, .SD), .SDcols=grep("value", names(mdt), value=TRUE)]

#pivot into original format and subset to those in d1
dcast.data.table(mdt, id ~ variable, value.var="value")[
    d1, .SD, on=.(id)]

edited Jun 1, 2018 at 1:44

answered May 31, 2018 at 8:52

chinsoon12

25.2k4 gold badges27 silver badges35 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

s_baldur Over a year ago

Maybe best you write it from scratch and just mention you were inspired by this question?

s_baldur Over a year ago

d2[match(d1$id, d2$id), names(d)] fails when d1 contains columns that are not missing, so maybe amend to: intersect(names(d1), names(d2))

Maurits Evers · Accepted Answer · 2018-05-31 08:34:02Z

2

Here is a possibility using dplyr::left_join:

left_join(d1, d2, by = "id") %>%
    mutate(
        x = ifelse(!is.na(x.x), x.x, x.y),
        y = ifelse(!is.na(y.x), y.x, y.y)) %>%
    select(id, x, y, z)
#  id  x  y z
#1  1  1 97 d
#2  2 14  3 c
#3  3  4  2 b

answered May 31, 2018 at 8:34

Maurits Evers

51k4 gold badges53 silver badges75 bronze badges

1 Comment

s_baldur Over a year ago

I failed to mention this in my op but the actual data.frames have hundreds of columns, so I would need to add more automation to this.

akrun · Accepted Answer · 2018-05-31 08:59:48Z

1

We can use data.table with coalesce from dplyr. Create a vector of column names that are common ('nm1') and difference ('nm2') in both datasets. Convert the first dataset to 'data.table' (setDT(d1)), join on the 'id' column, assign (:=) the coalesced columns of the first and second (with prefix i. - if there are common columns) to update the values in the first dataset

library(data.table)
nm1 <- setdiff(intersect(names(d1), names(d2)), 'id')
nm2 <- setdiff(names(d2), names(d1))
setDT(d1)[d2, c(nm1, nm2) := c(Map(dplyr::coalesce, mget(nm1), 
              mget(paste0("i.", nm1))), mget(nm2)), on = .(id)]
d1
#   id  x  y z
#1:  1  1 97 d
#2:  2 14  3 c
#3:  3  4  2 b

edited May 31, 2018 at 8:59

answered May 31, 2018 at 8:34

akrun

891k38 gold badges590 silver badges700 bronze badges

2 Comments

s_baldur Over a year ago

Get an error on the actual data (not the example data though) Error: Argument 2 must be type character, not double

akrun Over a year ago

@snoram Thanks for notifying me. We are assigning the values to the original dataset. Please check if you have any mismatch in type. If that is the case, change those mismatches first and then do the update

Collectives™ on Stack Overflow

Update existing data.frame with values from another one if missing

3 Answers 3

2 Comments

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related