Stratified sampling based on multiple predictors in R

Question

I have a dataset with two groups of people: vaccinated and unvaccinated. In the vaccinated group, each row represents a unique ID with a corresponding unique T0. In the unvaccinated group, each ID may appear in multiple rows, each associated with a different T0 (long format). Each row includes three variables: the number of PCP visits, specialty visits, and lab visits.

I aim to sample from this dataset such that the resulting data includes one unique ID per row and the average values for PCP visits, specialty visits, and lab visits are similar between the vaccinated and unvaccinated groups. How can I achieve this in R? I think this will at least involve some stratified sampling for the unvaccinated group because each person can have multiple records.

Below is some R code to create a sample data:

set.seed(123)
data <- data.frame(
  ID = c(1:10, rep(11:20, 4)), # Vaccinated IDs are unique, Unvaccinated have repeats
  group = c(rep("vaccinated", 10), rep("unvaccinated", 40)),
  T0 = rep(1:10, 5),
  PCP_visits = c(sample(0:10, 10, replace = TRUE), 
                 sample(3:13, 40, replace = TRUE)),
  specialty_visits = c(sample(0:5, 10, replace = TRUE),
                       sample(2:7, 40, replace = TRUE)),
  lab_visits = c(sample(0:8, 10, replace = TRUE),
                 sample(2:10, 40, replace = TRUE))
)

data

   ID        group T0 PCP_visits specialty_visits lab_visits
1   1   vaccinated  1          2                0          3
2   2   vaccinated  2          2                5          0
3   3   vaccinated  3          9                4          5
4   4   vaccinated  4          1                0          2
5   5   vaccinated  5          5                1          7
6   6   vaccinated  6         10                3          2
7   7   vaccinated  7          4                3          7
8   8   vaccinated  8          3                5          0
9   9   vaccinated  9          5                5          6
10 10   vaccinated 10          8                2          6
11 11 unvaccinated  1          9                5          6
12 12 unvaccinated  2         10                5          5
13 13 unvaccinated  3          4                0          6
14 14 unvaccinated  4          2                5          4
15 15 unvaccinated  5         10                1          5
16 16 unvaccinated  6          8                0          7
17 17 unvaccinated  7          8                1          4
18 18 unvaccinated  8          8                3          6
19 19 unvaccinated  9          2                4          3
20 20 unvaccinated 10          7                4          2
21 11 unvaccinated  1          9                5          8
22 12 unvaccinated  2          6                2          6
23 13 unvaccinated  3          9                0          5
24 14 unvaccinated  4          8                3          8
25 15 unvaccinated  5          2                5          6
26 16 unvaccinated  6          3                0          1
27 17 unvaccinated  7          0                5          2
28 18 unvaccinated  8         10                0          7
29 19 unvaccinated  9          6                2          3
30 20 unvaccinated 10          4                5          6
31 11 unvaccinated  1          9                3          3
32 12 unvaccinated  2          6                0          0
33 13 unvaccinated  3          8                5          7
34 14 unvaccinated  4          8                5          3
35 15 unvaccinated  5          9                2          8
36 16 unvaccinated  6          6                5          7
37 17 unvaccinated  7         10                4          5
38 18 unvaccinated  8          4                2          3
39 19 unvaccinated  9          6                5          7
40 20 unvaccinated 10          4                1          2
41 11 unvaccinated  1         10                4          3
42 12 unvaccinated  2          5                4          3
43 13 unvaccinated  3          8                2          5
44 14 unvaccinated  4          1                1          0
45 15 unvaccinated  5          4                1          3
46 16 unvaccinated  6          7                1          8
47 17 unvaccinated  7          1                3          6
48 18 unvaccinated  8          0                1          7
49 19 unvaccinated  9          8                1          4
50 20 unvaccinated 10         10                5          1

The vaccinated and unvaccinated groups are systematically different, and the sampling process needs to make sure the two groups after sampling are similar in terms of PCP_visits, specialty_visits, and lab_visits are similar.

# Charateristics before sampling
data %>% 
  group_by(group) %>% 
  summarise(n = n(),
            PCP_visits = mean(PCP_visits),
            specialty_visits = mean(specialty_visits),
            lab_visits = mean(lab_visits))

# A tibble: 2 × 5
  group            n PCP_visits specialty_visits lab_visits
  <chr>        <int>      <dbl>            <dbl>      <dbl>
1 unvaccinated    40        8.9             4.25       5.72
2 vaccinated      10        4.6             2.3        3.3

Edward · Accepted Answer · 2025-01-02 10:51:25Z

You can use the CVXR package to find the optimum selection of unvaccinated rows under the constraint that no unvaccinated with the same T0 value is duplicated in the result.

library(dplyr)
library(CVXR)

my_fun <- function(df) {
  dfv <- df[df$group=='vaccinated',]
  dfu <- df[df$group=='unvaccinated',]
  
  PCP_v <- t(dfv$PCP)
  specialty_v <- t(dfv$specialty)
  lab_v <- t(df$lab)
  
  PCP_u <- t(dfu$PCP)
  specialty_u <- t(dfu$specialty)
  lab_u <- t(dfu$lab)
  
  x <- Variable(nrow(dfu), boolean=TRUE)
  
  # The objective is to minimize the mean differences
  objective <- Minimize(
      abs(mean(PCP_v) - mean(PCP_u%*%x)) + 
      abs(mean(specialty_v - specialty_u%*%x)) + 
      abs(mean(lab_v - lab_u%*%x)))

  # Under the constraint that 1 and only 1 unvaccinated record per T0 is chosen.
  constraints <- lapply(unique(dfu$T0), \(i) sum(x[seq_along(dfu$T0)[dfu$T0==i]]) == 1)
  
  prob <- Problem(objective, constraints)
  res <- solve(prob)
  
  arrange(dfu[which(res$getValue(x)==1), ], T0)
}

Call the function, row binding the result to the vaccinated records.

df2 <- filter(data, group=='vaccinated') %>%
  bind_rows(my_fun(data))

Check:

df2 %>% 
  group_by(group) %>% 
  summarise(n = n(),
            PCP_visits = mean(PCP_visits),
            specialty_visits = mean(specialty_visits),
            lab_visits = mean(lab_visits))
# A tibble: 2 × 5
  group            n PCP_visits specialty_visits lab_visits
  <chr>        <int>      <dbl>            <dbl>      <dbl>
1 unvaccinated    10        6.3              3.6        4.7
2 vaccinated      10        4.9              2.8        3.8

Collectives™ on Stack Overflow

Stratified sampling based on multiple predictors in R

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related