1

I have a data da that has some date frequency distribution (see below). I have another data db, in which each id may have one or multiple records. Is there a feasible way to get one record (no more and no less) for each group, so that the sampled distribution of date in db is as close as possible to that of da?

library(data.table)
library(dplyr)
library(lubridate)

# da is the data to be emulated
da = data.table(id = paste0('a', 1:7),
                date = ymd(c('2021-1-10', 
                             rep('2021-1-11', 2), 
                             rep('2021-1-12', 3), 
                             '2021-1-13')))

da
da[,.N,(date)][,.(date, N, perc = N/sum(N))]
#          date N      perc
# 1: 2021-01-10 1 0.1428571
# 2: 2021-01-11 2 0.2857143
# 3: 2021-01-12 3 0.4285714
# 4: 2021-01-13 1 0.1428571


# need to get only one (no more and no less) sample for each id 
# to emulate the distribution of date in da
set.seed(123)
db = structure(list(id = c(1L, 2L, 3L, 3L, 3L, 4L, 5L, 6L, 6L, 8L), 
               date = structure(c(18638, 18639, 18639, 18640, 18640, 18637, 
                                  18640, 18637, 18638, 18639), class = "Date")), 
          class = c("data.table", "data.frame"))

> db
    id       date
 1:  1 2021-01-11
 2:  2 2021-01-12
 3:  3 2021-01-12
 4:  3 2021-01-13
 5:  3 2021-01-13
 6:  4 2021-01-10
 7:  5 2021-01-13
 8:  6 2021-01-10
 9:  6 2021-01-11
10:  8 2021-01-12

1 Answer 1

1

For a probabilistic approach (works well if the data is fairly large), you could use the strata function from the sampling package.

Here is the distribution to be emulated, which we save to an R object.

da.N <- da[,.N,(date)][,.(date, N, perc = N/sum(N))]; da.N

date     N      perc
<Date> <int>     <num>
1: 2021-01-10     1 0.1428571
2: 2021-01-11     2 0.2857143
3: 2021-01-12     3 0.4285714
4: 2021-01-13     1 0.1428571

Merge the two tables on date so that the percent column is in the data we want to sample from.

db <- da.N[db, on='date']

Then perform systematic unequal probability sampling on db. This will randomly select one row (size=1) from each id (strata) with inclusion probabilities ("pik") given by the distribution of the dates in da (column perc).

library(sampling)
set.seed(12)  # for reproducibility (omit in reality)

s <- strata(db, 
            stratanames="id", 
            size=rep(1, length(unique(db$id))), 
            pik=db$perc,  # must be db$perc
            method="systematic")

Get the data and check the frequency distribution of the dates.

db2 <- getdata(db, s)
setDT(db2)[,.N,(date)][,.(date, N, perc = N/sum(N))]

date     N      perc
<Date> <int>     <num>
1: 2021-01-10     2 0.2857143
2: 2021-01-11     1 0.1428571
3: 2021-01-12     2 0.2857143
4: 2021-01-13     2 0.2857143

Hard to say if it worked with such a small sample. Let's try it for a larger dataset.

set.seed(123)
db <- data.table(id = sample(1:400, 10000, TRUE),
                date = ymd(rep('2021-1-10', 100),
                           rep('2021-1-11', 100),
                           rep('2021-1-12', 100),
                           rep('2021-1-13', 100)))
db <- da.N[db, on='date']

set.seed(123)
s <- strata(db, 
            stratanames="id", 
            size=rep(1, length(unique(db$id))), 
            pik=db$perc,
            method="systematic")

db2 <- getdata(db, s)
setDT(db2)[,.N,(date)][,.(date, N, perc = N/sum(N))][order(date)]

         date     N   perc
       <Date> <int>  <num>
1: 2021-01-10    71 0.1775  # 0.1428571
2: 2021-01-11   109 0.2725  # 0.2857143
3: 2021-01-12   166 0.4150  # 0.4285714
4: 2021-01-13    54 0.1350  # 0.1428571

This is fairly close to the desired distribution (numbers after the #), and the ids should be unique.

db2[, .N, (id)]
        id     N
     <int> <int>
  1:   179     1
  2:    14     1
  3:   195     1
  4:   306     1
  5:   118     1
 ---            
396:   233     1
397:   192     1
398:    15     1
399:    62     1
400:    97     1
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.