18

I have two dataframes each with multiple rows per ID. I need to return the closest date and related data from the second dataframe based on the ID and date of the first dataframe - adding the related data to the first dataframe. This also has to work with NAs present in the second dataframe. Example data:

set.seed(42)
df1 <- data.frame(ID = sample(1:3, 10, rep=T), dateTarget=(strptime((paste
    (sprintf("%02d", sample(1:30,10, rep=T)), sprintf("%02d", sample(1:12,10, rep=T)), 
     (sprintf("%02d", sample(2013:2015,10, rep=T))), sep="")),"%d%m%Y")), Value=sample(15:100, 10, rep=T))
df2 <- data.frame(ID = sample(1:3, 10, rep=T), dateTarget=(strptime((paste
     (sprintf("%02d", sample(1:30,20, rep=T)), sprintf("%02d", sample(1:12,20, rep=T)), 
     (sprintf("%02d", sample(2013:2015,20, rep=T))), sep="")),"%d%m%Y")), ValueMatch=sample(15:100, 20, rep=T))

Something from base preferable - split and a mixture of the apply family?

The final table would look something like:

  ID dateTarget Value dateMatch ValueMatch
1  3   22-02-15    52  09-03-15         94
2  1   29-12-14    18  06-12-14         88
3  3   08-12-15    98  06-07-15         48
4  2   14-01-13    52  08-04-13         77
5  2   29-07-15    97  01-08-15         68
6  3   30-05-13    91  01-04-13         85
7  1   04-11-13    70  21-02-14         35
8  2   15-06-15    98  01-08-15         68
9  3   17-11-14    68  15-12-14         95

P.S. Are there better ways of generating random dates (not using seq.Date)?

2
  • For your "P.S." you should be able to adapt stackoverflow.com/questions/14720983/… with an as.Date at the end of the function (and, perhaps a format if you need it in %d-%m-%Y in the data frame) Commented Jan 21, 2015 at 17:15
  • 1
    You can also do something like Sys.Date() + sample(-1000:1000, 20) if you don't care too much about start / end dates Commented Jan 21, 2015 at 17:36

4 Answers 4

16

Here is the solution based on the base package:

z <- lapply(intersect(df1$ID,df2$ID),function(id) {
   d1 <- subset(df1,ID==id)
   d2 <- subset(df2,ID==id)

   d1$indices <- sapply(d1$dateTarget,function(d) which.min(abs(d2$dateTarget - d)))
   d2$indices <- 1:nrow(d2)

   merge(d1,d2,by=c('ID','indices'))
  })

z2 <- do.call(rbind,z)
z2$indices <- NULL

print(z2)

#    ID dateTarget.x Value dateTarget.y ValueMatch
# 1   3   2015-11-14    47   2015-07-06         48
# 2   3   2015-12-08    98   2015-07-06         48
# 3   3   2015-02-22    52   2015-03-09         94
# 4   3   2014-11-17    68   2014-12-15         95
# 5   3   2013-05-30    91   2013-04-01         85
# 6   1   2013-11-04    70   2014-02-21         35
# 7   1   2014-12-29    18   2014-12-06         88
# 8   2   2013-01-14    52   2013-04-08         77
# 9   2   2015-07-29    97   2015-08-01         68
# 10  2   2015-06-15    98   2015-08-01         68
Sign up to request clarification or add additional context in comments.

1 Comment

Works indeed. I'll apply to the real dataset and work through understanding it better.
14

Using data.table, simple and elegant solution:

library(data.table)

setDT(df1)
setDT(df2)

setkey(df2, ID, dateTarget)[, dateMatch:=dateTarget]
df2[df1, roll='nearest']

    ID dateTarget ValueMatch  dateMatch Value
 1:  3 2015-11-14         48 2015-07-06    47
 2:  3 2015-02-22         94 2015-03-09    52
 3:  1 2014-12-29         88 2014-12-06    18
 4:  3 2015-12-08         48 2015-07-06    98
 5:  2 2013-01-14         77 2013-04-08    52
 6:  2 2015-07-29         68 2015-08-01    97
 7:  3 2013-05-30         85 2013-04-01    91
 8:  1 2013-11-04         35 2014-02-21    70
 9:  2 2015-06-15         68 2015-08-01    98
10:  3 2014-11-17         95 2014-12-15    68

4 Comments

Good. I did try data.table from this example, but was stuck on how to use two variables as key.
Here is a very very good intro (10 mins reading) presenting what you asked: cran.r-project.org/web/packages/data.table/vignettes/…
Hello, I am following the exact same code but I am getting the following error Error in bmerge(i, x, leftcols, rightcols, xo, roll, rollends, nomatch, : typeof x.IMO (double) != typeof i.name (character) could you please help?
@AmitR.Pathak, try adding setkey(df2, ID, dateTarget)
3

Here's my take using dplyr, based on the accepted answer. I wanted to have a bit more freedom on the grouping column.


match_by_group_date <- function(df1, df2, grp, datecol) {
  
  grp1 <- df1 %>% pull({{grp}}) %>% unique()
  grp2 <- df2 %>% pull({{grp}}) %>% unique()
  
  li <-
  lapply(intersect(grp1, grp2), function(tt) {
    d1 <- filter(df1, {{grp}}== tt)
    d2 <- filter(df2, {{grp}}==tt) %>% mutate(indices = 1:n())
    d2_date <- d2 %>% pull({{datecol}}) %>% as.POSIXct()
    print(d2_date)
    d1 <- mutate(d1, indices = map_dbl({{datecol}}, function(d) which.min(abs(d2_date - as.POSIXct(d)))))
    
    left_join(d1,d2, by=c(quo_name(enquo(grp)), "indices"))
  })
  
  # bind rows
  return(bind_rows(li))
}

Update

As of 2022, there is a join_by() in the works. See dplyr dev docs here

https://dplyr.tidyverse.org/dev/reference/join_by.html

For now I will continue using this method, or data.table. But join_by() will probably get stable enough that is fast and preferred.

Comments

2

We can also do this by one-liner with dplyr.

library(dplyr)

left_join(df1, df2, by = "ID") %>%
  mutate(dateDiff = abs(dateTarget.x - dateTarget.y)) %>%
  group_by(ID, dateTarget.x) %>%
  filter(dateDiff == min(dateDiff))

1 Comment

This works, but creates a very large dataframe of nrow(df_1)*nrow(row_df2) prior to filtering

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.