sampling one row from each strata to emulate the distribution of a variable in r

Question

I have a data da that has some date frequency distribution (see below). I have another data db, in which each id may have one or multiple records. Is there a feasible way to get one record (no more and no less) for each group, so that the sampled distribution of date in db is as close as possible to that of da?

library(data.table)
library(dplyr)
library(lubridate)

# da is the data to be emulated
da = data.table(id = paste0('a', 1:7),
                date = ymd(c('2021-1-10', 
                             rep('2021-1-11', 2), 
                             rep('2021-1-12', 3), 
                             '2021-1-13')))

da
da[,.N,(date)][,.(date, N, perc = N/sum(N))]
#          date N      perc
# 1: 2021-01-10 1 0.1428571
# 2: 2021-01-11 2 0.2857143
# 3: 2021-01-12 3 0.4285714
# 4: 2021-01-13 1 0.1428571


# need to get only one (no more and no less) sample for each id 
# to emulate the distribution of date in da
set.seed(123)
db = structure(list(id = c(1L, 2L, 3L, 3L, 3L, 4L, 5L, 6L, 6L, 8L), 
               date = structure(c(18638, 18639, 18639, 18640, 18640, 18637, 
                                  18640, 18637, 18638, 18639), class = "Date")), 
          class = c("data.table", "data.frame"))

> db
    id       date
 1:  1 2021-01-11
 2:  2 2021-01-12
 3:  3 2021-01-12
 4:  3 2021-01-13
 5:  3 2021-01-13
 6:  4 2021-01-10
 7:  5 2021-01-13
 8:  6 2021-01-10
 9:  6 2021-01-11
10:  8 2021-01-12

Edward · Accepted Answer · 2024-07-19 06:33:33Z

For a probabilistic approach (works well if the data is fairly large), you could use the strata function from the sampling package.

Here is the distribution to be emulated, which we save to an R object.

da.N <- da[,.N,(date)][,.(date, N, perc = N/sum(N))]; da.N

date     N      perc
<Date> <int>     <num>
1: 2021-01-10     1 0.1428571
2: 2021-01-11     2 0.2857143
3: 2021-01-12     3 0.4285714
4: 2021-01-13     1 0.1428571

Merge the two tables on date so that the percent column is in the data we want to sample from.

db <- da.N[db, on='date']

Then perform systematic unequal probability sampling on db. This will randomly select one row (size=1) from each id (strata) with inclusion probabilities ("pik") given by the distribution of the dates in da (column perc).

library(sampling)
set.seed(12)  # for reproducibility (omit in reality)

s <- strata(db, 
            stratanames="id", 
            size=rep(1, length(unique(db$id))), 
            pik=db$perc,  # must be db$perc
            method="systematic")

Get the data and check the frequency distribution of the dates.

db2 <- getdata(db, s)
setDT(db2)[,.N,(date)][,.(date, N, perc = N/sum(N))]

date     N      perc
<Date> <int>     <num>
1: 2021-01-10     2 0.2857143
2: 2021-01-11     1 0.1428571
3: 2021-01-12     2 0.2857143
4: 2021-01-13     2 0.2857143

Hard to say if it worked with such a small sample. Let's try it for a larger dataset.

set.seed(123)
db <- data.table(id = sample(1:400, 10000, TRUE),
                date = ymd(rep('2021-1-10', 100),
                           rep('2021-1-11', 100),
                           rep('2021-1-12', 100),
                           rep('2021-1-13', 100)))
db <- da.N[db, on='date']

set.seed(123)
s <- strata(db, 
            stratanames="id", 
            size=rep(1, length(unique(db$id))), 
            pik=db$perc,
            method="systematic")

db2 <- getdata(db, s)
setDT(db2)[,.N,(date)][,.(date, N, perc = N/sum(N))][order(date)]

         date     N   perc
       <Date> <int>  <num>
1: 2021-01-10    71 0.1775  # 0.1428571
2: 2021-01-11   109 0.2725  # 0.2857143
3: 2021-01-12   166 0.4150  # 0.4285714
4: 2021-01-13    54 0.1350  # 0.1428571

This is fairly close to the desired distribution (numbers after the #), and the ids should be unique.

db2[, .N, (id)]
        id     N
     <int> <int>
  1:   179     1
  2:    14     1
  3:   195     1
  4:   306     1
  5:   118     1
 ---            
396:   233     1
397:   192     1
398:    15     1
399:    62     1
400:    97     1

Collectives™ on Stack Overflow

sampling one row from each strata to emulate the distribution of a variable in r

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related