2

I'm looking for the (1) name and (2) a (cleaner) method in R (base and data.table preferred) of the following.

Input

> d1
  id  x  y
1  1  1 NA
2  2 NA  3
3  3  4 NA
> d2
  id  x  y z
1  4 NA 30 a
2  3 20  2 b
3  2 14 NA c
4  1 15 97 d

(note that the actual data.frames have hundreds of columns)

Expected output:

> d1
  id  x  y z
1  1  1 97 d
2  2 14  3 c
3  3  4  2 b

Data and current solution:

d1 <- data.frame(id = 1:3, x = c(1, NA, 4), y = c(NA, 3, NA))
d2 <- data.frame(id = 4:1, x = c(NA, 20, 14, 15), y = c(30, 2, NA, 97), z = letters[1:4])

for (col in setdiff(names(d1), "id")) {
  # If missing look in d2
  missing <- is.na(d1[[col]])
  d1[missing, col] <- d2[match(d1$id[missing], d2$id), col]
} 
for (col in setdiff(names(d2), names(d1))) {
  # If column missing then add
  d1[[col]] <- d2[match(d1$id, d2$id), col]
}

PS:

Likely this questions has been asked before but I'm lacking in vocabulary to search it.

1
  • 1
    maybe the word u are looking for is coalesce? Commented May 31, 2018 at 8:40

3 Answers 3

2

Assuming you are working with 2 data.frames, here is a base solution

#expand d1 to have the same columns as d2
d <- merge(d1, d2[, c("id", setdiff(names(d2), names(d1))), drop=FALSE], 
    by="id", all.x=TRUE, all.y=FALSE)

#make sure that d2 also have same number of columns as d1
d2 <- merge(d2, d1[, c("id", setdiff(names(d1), names(d2))), drop=FALSE], 
    by="id", all.x=TRUE, all.y=FALSE)

#align rows and columns to match those in d1
mask <- d2[match(d1$id, d2$id), names(d)]

#replace NAs with those mask
replace(d, is.na(d), mask[is.na(d)])

If you dont mind, we can rewrite your question into a general matrix-coalesce question (i.e. any number of matrices, columns, rows) which seems like it has not been asked before.


edit:

Another base R solution is a hack of coalesce1a from How to implement coalesce efficiently in R

coalesce.mat <- function(...) {
    ans <- ..1  
    for (elt in list(...)[-1]) {
        rn <- match(ans$id, elt$id)
        ans[is.na(ans)] <- elt[rn, names(ans)][is.na(ans)]
    }
    ans         
}

allcols <- Reduce(union, lapply(list(d1, d2), names))
do.call(coalesce.mat, 
    lapply(list(d1, d2), function(x) {
        x[, setdiff(allcols, names(x))] <- NA
        x 
    }))

edit:

a possible data.table solution using coalesce1a from How to implement coalesce efficiently in R by Martin Morgan.

coalesce1a <- function(...) {
    ans <- ..1
    for (elt in list(...)[-1]) {
        i <- which(is.na(ans))
        ans[i] <- elt[i]
    }
    ans
}

setDT(d1)
setDT(d2)

#melt into long formats and full outer join the 2
mdt <- merge(melt(d1, id.vars="id"), melt(d2, id.vars="id"), by=c("id","variable"), all=TRUE)

#perform a coalesce on vectors
mdt[, value := do.call(coalesce1a, .SD), .SDcols=grep("value", names(mdt), value=TRUE)]

#pivot into original format and subset to those in d1
dcast.data.table(mdt, id ~ variable, value.var="value")[
    d1, .SD, on=.(id)]
Sign up to request clarification or add additional context in comments.

2 Comments

Maybe best you write it from scratch and just mention you were inspired by this question?
d2[match(d1$id, d2$id), names(d)] fails when d1 contains columns that are not missing, so maybe amend to: intersect(names(d1), names(d2))
2

Here is a possibility using dplyr::left_join:

left_join(d1, d2, by = "id") %>%
    mutate(
        x = ifelse(!is.na(x.x), x.x, x.y),
        y = ifelse(!is.na(y.x), y.x, y.y)) %>%
    select(id, x, y, z)
#  id  x  y z
#1  1  1 97 d
#2  2 14  3 c
#3  3  4  2 b

1 Comment

I failed to mention this in my op but the actual data.frames have hundreds of columns, so I would need to add more automation to this.
1

We can use data.table with coalesce from dplyr. Create a vector of column names that are common ('nm1') and difference ('nm2') in both datasets. Convert the first dataset to 'data.table' (setDT(d1)), join on the 'id' column, assign (:=) the coalesced columns of the first and second (with prefix i. - if there are common columns) to update the values in the first dataset

library(data.table)
nm1 <- setdiff(intersect(names(d1), names(d2)), 'id')
nm2 <- setdiff(names(d2), names(d1))
setDT(d1)[d2, c(nm1, nm2) := c(Map(dplyr::coalesce, mget(nm1), 
              mget(paste0("i.", nm1))), mget(nm2)), on = .(id)]
d1
#   id  x  y z
#1:  1  1 97 d
#2:  2 14  3 c
#3:  3  4  2 b

2 Comments

Get an error on the actual data (not the example data though) Error: Argument 2 must be type character, not double
@snoram Thanks for notifying me. We are assigning the values to the original dataset. Please check if you have any mismatch in type. If that is the case, change those mismatches first and then do the update

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.