Merge nearest date, and related variables from a another dataframe by group

Question

I have two dataframes each with multiple rows per ID. I need to return the closest date and related data from the second dataframe based on the ID and date of the first dataframe - adding the related data to the first dataframe. This also has to work with NAs present in the second dataframe. Example data:

set.seed(42)
df1 <- data.frame(ID = sample(1:3, 10, rep=T), dateTarget=(strptime((paste
    (sprintf("%02d", sample(1:30,10, rep=T)), sprintf("%02d", sample(1:12,10, rep=T)), 
     (sprintf("%02d", sample(2013:2015,10, rep=T))), sep="")),"%d%m%Y")), Value=sample(15:100, 10, rep=T))
df2 <- data.frame(ID = sample(1:3, 10, rep=T), dateTarget=(strptime((paste
     (sprintf("%02d", sample(1:30,20, rep=T)), sprintf("%02d", sample(1:12,20, rep=T)), 
     (sprintf("%02d", sample(2013:2015,20, rep=T))), sep="")),"%d%m%Y")), ValueMatch=sample(15:100, 20, rep=T))

Something from base preferable - split and a mixture of the apply family?

The final table would look something like:

  ID dateTarget Value dateMatch ValueMatch
1  3   22-02-15    52  09-03-15         94
2  1   29-12-14    18  06-12-14         88
3  3   08-12-15    98  06-07-15         48
4  2   14-01-13    52  08-04-13         77
5  2   29-07-15    97  01-08-15         68
6  3   30-05-13    91  01-04-13         85
7  1   04-11-13    70  21-02-14         35
8  2   15-06-15    98  01-08-15         68
9  3   17-11-14    68  15-12-14         95

P.S. Are there better ways of generating random dates (not using seq.Date)?

For your "P.S." you should be able to adapt stackoverflow.com/questions/14720983/… with an as.Date at the end of the function (and, perhaps a format if you need it in %d-%m-%Y in the data frame) — hrbrmstr
– hrbrmstr, Commented Jan 21, 2015 at 17:15
You can also do something like Sys.Date() + sample(-1000:1000, 20) if you don't care too much about start / end dates — talat
– talat, Commented Jan 21, 2015 at 17:36

Marat Talipov · Accepted Answer · 2015-01-21 17:31:23Z

16

Here is the solution based on the base package:

z <- lapply(intersect(df1$ID,df2$ID),function(id) {
   d1 <- subset(df1,ID==id)
   d2 <- subset(df2,ID==id)

   d1$indices <- sapply(d1$dateTarget,function(d) which.min(abs(d2$dateTarget - d)))
   d2$indices <- 1:nrow(d2)

   merge(d1,d2,by=c('ID','indices'))
  })

z2 <- do.call(rbind,z)
z2$indices <- NULL

print(z2)

#    ID dateTarget.x Value dateTarget.y ValueMatch
# 1   3   2015-11-14    47   2015-07-06         48
# 2   3   2015-12-08    98   2015-07-06         48
# 3   3   2015-02-22    52   2015-03-09         94
# 4   3   2014-11-17    68   2014-12-15         95
# 5   3   2013-05-30    91   2013-04-01         85
# 6   1   2013-11-04    70   2014-02-21         35
# 7   1   2014-12-29    18   2014-12-06         88
# 8   2   2013-01-14    52   2013-04-08         77
# 9   2   2015-07-29    97   2015-08-01         68
# 10  2   2015-06-15    98   2015-08-01         68

answered Jan 21, 2015 at 17:31

Marat Talipov

13.4k5 gold badges37 silver badges53 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Simon Over a year ago

Works indeed. I'll apply to the real dataset and work through understanding it better.

Colonel Beauvel · Accepted Answer · 2015-01-21 18:01:06Z

14

Using data.table, simple and elegant solution:

library(data.table)

setDT(df1)
setDT(df2)

setkey(df2, ID, dateTarget)[, dateMatch:=dateTarget]
df2[df1, roll='nearest']

    ID dateTarget ValueMatch  dateMatch Value
 1:  3 2015-11-14         48 2015-07-06    47
 2:  3 2015-02-22         94 2015-03-09    52
 3:  1 2014-12-29         88 2014-12-06    18
 4:  3 2015-12-08         48 2015-07-06    98
 5:  2 2013-01-14         77 2013-04-08    52
 6:  2 2015-07-29         68 2015-08-01    97
 7:  3 2013-05-30         85 2013-04-01    91
 8:  1 2013-11-04         35 2014-02-21    70
 9:  2 2015-06-15         68 2015-08-01    98
10:  3 2014-11-17         95 2014-12-15    68

answered Jan 21, 2015 at 18:01

Colonel Beauvel

31.3k11 gold badges49 silver badges88 bronze badges

4 Comments

Simon Over a year ago

Good. I did try data.table from this example, but was stuck on how to use two variables as key.

Colonel Beauvel Over a year ago

Here is a very very good intro (10 mins reading) presenting what you asked: cran.r-project.org/web/packages/data.table/vignettes/…

marine8115 Over a year ago

Hello, I am following the exact same code but I am getting the following error Error in bmerge(i, x, leftcols, rightcols, xo, roll, rollends, nomatch, : typeof x.IMO (double) != typeof i.name (character) could you please help?

Joost Keuskamp Over a year ago

@AmitR.Pathak, try adding setkey(df2, ID, dateTarget)

Matias Andina · Accepted Answer · 2022-08-19 18:57:50Z

Here's my take using dplyr, based on the accepted answer. I wanted to have a bit more freedom on the grouping column.


match_by_group_date <- function(df1, df2, grp, datecol) {
  
  grp1 <- df1 %>% pull({{grp}}) %>% unique()
  grp2 <- df2 %>% pull({{grp}}) %>% unique()
  
  li <-
  lapply(intersect(grp1, grp2), function(tt) {
    d1 <- filter(df1, {{grp}}== tt)
    d2 <- filter(df2, {{grp}}==tt) %>% mutate(indices = 1:n())
    d2_date <- d2 %>% pull({{datecol}}) %>% as.POSIXct()
    print(d2_date)
    d1 <- mutate(d1, indices = map_dbl({{datecol}}, function(d) which.min(abs(d2_date - as.POSIXct(d)))))
    
    left_join(d1,d2, by=c(quo_name(enquo(grp)), "indices"))
  })
  
  # bind rows
  return(bind_rows(li))
}

Update

As of 2022, there is a join_by() in the works. See dplyr dev docs here

https://dplyr.tidyverse.org/dev/reference/join_by.html

For now I will continue using this method, or data.table. But join_by() will probably get stable enough that is fast and preferred.

teru · Accepted Answer · 2020-09-05 13:12:20Z

2

We can also do this by one-liner with dplyr.

library(dplyr)

left_join(df1, df2, by = "ID") %>%
  mutate(dateDiff = abs(dateTarget.x - dateTarget.y)) %>%
  group_by(ID, dateTarget.x) %>%
  filter(dateDiff == min(dateDiff))

answered Sep 5, 2020 at 13:12

teru

3283 silver badges7 bronze badges

1 Comment

Joost Keuskamp Over a year ago

This works, but creates a very large dataframe of nrow(df_1)*nrow(row_df2) prior to filtering

Collectives™ on Stack Overflow

Merge nearest date, and related variables from a another dataframe by group

4 Answers 4

1 Comment

4 Comments

Update

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

4 Comments

Update

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related