1

I have two data.frames with thousands of rows and dozens of columns each, both created by merging several csv files. The data.frames are exactly what I wanted. I shall also add that that df1 and df2 have several columns in common. Only problem is that in one of them, say df1 there are, for some columns, some NAs (this is expected/normal). The good thing is that the same columns in which I have NAs also occur in the second data.frame, say df2, but with no NAs. What I would like to do is to fill the NAs in a given column of df2 with values from the same column of df1 where each value is selected based on the matching of other columns value between df1 and df2.

Using some random data:

A<- format(seq(as.Date("2021/09/01"), by = "day", length.out = 26), format="%Y%m%d")
B<- format(seq(as.POSIXct("2021-1-1 15:00"), as.POSIXct("2021-1-02 16:00"), by = "hour"), format = "%H:%M")
C<- sample(1:100, 26, replace=FALSE)
D<- LETTERS
E<- paste(D, C)

df1<- cbind(A, B, C, D)
df2<- cbind(A, B, C, E)
df2[c(7:10, 13, 18:21), 3] <- NA #replace some of the values with NAs
df2<- df2[-c(2,5,11,17,23,26),] #delete some columns so df1 and df2 are of different size

I would like that for df2$C, when a NA value is found, both the corresponding df2$A and df2$B values are matched against all df1$A and df2$B. When, and only if, a row in df1 is found with matching df1$A and df1$B, the corresponding value of df1$C is copied in df2$C NA value. In other words, for the example data above, if there is a NA value in df2$C, the corresponding combination of date and time for that specific row should be matched against all the resulting concatenation of date and time from df1 and when a match is found, the corresponding value of df1$C is used to replace the NA in df2$C.

# as an instance, a case of my df2 where a NA occurs is
df2[17,3] 

#This should be replaced with the value from
df1[21,3] 

# because the time and date of 
df2[17,] 

A          B          C          E 
"20210921"    "11:00"         NA     "U 46" 

#is the same than 
df1[21,] 

A          B          C          D 
"20210921"    "11:00"       "46"        "U" 

I am 100% sure that a concatenation of column A and B in each of the data.frame gives unique results in each data.frame so that a concatenation of df1$A and df1$B will be unique and the same applies to df2. Also, with certainty, there will be one and only value resulting from the concatenation of df2$A and df2$B matching the concatenation of column A and B in df1. For this reason I have tried concatenating for each of the data.frames the columns A and B into a new column to find matching values using match(paste(...,...)) within a elseif loop in Base R and alternatively other solutions by using the dplyr library proposed in other fora, but I can't find a solution that works for me. I think this is similar to an Excel lookup/vlookup function with multiple criteria but I can't figure this one out in R.

Any suggestion on how to proceed? Thanks.

2
  • First thing, you are missing code for 'A'. Second thing, it would be a lot easier to help you if you posted your desired output as code. In other words, show us what your desired output should look like. Commented Sep 3, 2021 at 15:15
  • I have edited the question and the code. Commented Sep 3, 2021 at 16:03

3 Answers 3

1

Since you mention dplyr already; it provides the tools for the job. Simply left_join on the columns you wish to match, then coalesce the results to replace missing values of 'C' with matching values from the second table.

library(dplyr)
df1 <- as.data.frame(df1)
df2 <- as.data.frame(df2)

concatenate <- df2 %>%
    left_join(df1, by = c('A', 'B'), suffix = c('', '.y')) %>%
    mutate(C = coalesce(C, C.y)) %>%
    select(-c(C.y))
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks Colin for this solution that worked absolutely fine with me. In the example data I provided above I forgot the as.data.frame() because my real data are already of class "data.frame" and I overlooked the code in my question. If I really have to be picky, I would modify the last line of your code as select(-c(C.y, D)) to get rid of the extra concatenate$D originally absent from df2.
0

You can create a unique key to update df2.

unique_key1 <- paste(df1$A, df1$B)
unique_key2 <- paste(df2$A, df2$B)
inds <- is.na(df2$C)
df2$C[inds] <- df1$C[match(unique_key2[inds], unique_key1)]
df2

#         A     B  C    E
#1  20210901 15:00 74 A 74
#2  20210903 17:00 27 C 27
#3  20210904 18:00 60 D 60
#4  20210906 20:00  7  F 7
#5  20210907 21:00 96 G 96
#6  20210908 22:00 98 H 98
#7  20210909 23:00 38 I 38
#8  20210910 00:00 89 J 89
#9  20210912 02:00 69 L 69
#10 20210913 03:00 72 M 72
#11 20210914 04:00 76 N 76
#12 20210915 05:00 63 O 63
#13 20210916 06:00 13 P 13
#14 20210918 08:00 25 R 25
#15 20210919 09:00 92 S 92
#16 20210920 10:00 21 T 21
#17 20210921 11:00 79 U 79
#18 20210922 12:00 41 V 41
#19 20210924 14:00 97 X 97
#20 20210925 15:00 16 Y 16

data

cbind creates a matrix, use data.frame to create dataframes.

df1 <- data.frame(A, B, C, D)
df2 <- data.frame(A, B, C, E)

1 Comment

Thanks Ronak for this solution that worked absolutely fine with me. The solution is very similar to what I tried to do initially with an elseif() function but for some reasons the resulting df2$C was left with several NAs despite values were available in df1$C for replacement. Here is what I had df2$C <- ifelse(is.na(df2$C), match(paste(df1$A, df1$B), paste(df2$A, df2$B)), df2$C) and I can't figure out why it didn't work.
0

How about dplyr::rows_patch? This works especially well for multiple columns that need coalescing.

library(dplyr)
df1 <- as.data.frame(df1)
df2 <- as.data.frame(df2) %>% select(any_of(names(df1)))
dplyr::rows_patch(df1, df2)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.