I have a data da that has some date frequency distribution (see below). I have another data db, in which each id may have one or multiple records. Is there a feasible way to get one record (no more and no less) for each group, so that the sampled distribution of date in db is as close as possible to that of da?
library(data.table)
library(dplyr)
library(lubridate)
# da is the data to be emulated
da = data.table(id = paste0('a', 1:7),
date = ymd(c('2021-1-10',
rep('2021-1-11', 2),
rep('2021-1-12', 3),
'2021-1-13')))
da
da[,.N,(date)][,.(date, N, perc = N/sum(N))]
# date N perc
# 1: 2021-01-10 1 0.1428571
# 2: 2021-01-11 2 0.2857143
# 3: 2021-01-12 3 0.4285714
# 4: 2021-01-13 1 0.1428571
# need to get only one (no more and no less) sample for each id
# to emulate the distribution of date in da
set.seed(123)
db = structure(list(id = c(1L, 2L, 3L, 3L, 3L, 4L, 5L, 6L, 6L, 8L),
date = structure(c(18638, 18639, 18639, 18640, 18640, 18637,
18640, 18637, 18638, 18639), class = "Date")),
class = c("data.table", "data.frame"))
> db
id date
1: 1 2021-01-11
2: 2 2021-01-12
3: 3 2021-01-12
4: 3 2021-01-13
5: 3 2021-01-13
6: 4 2021-01-10
7: 5 2021-01-13
8: 6 2021-01-10
9: 6 2021-01-11
10: 8 2021-01-12