1

I have a dataset with two groups of people: vaccinated and unvaccinated. In the vaccinated group, each row represents a unique ID with a corresponding unique T0. In the unvaccinated group, each ID may appear in multiple rows, each associated with a different T0 (long format). Each row includes three variables: the number of PCP visits, specialty visits, and lab visits.

I aim to sample from this dataset such that the resulting data includes one unique ID per row and the average values for PCP visits, specialty visits, and lab visits are similar between the vaccinated and unvaccinated groups. How can I achieve this in R? I think this will at least involve some stratified sampling for the unvaccinated group because each person can have multiple records.

Below is some R code to create a sample data:

set.seed(123)
data <- data.frame(
  ID = c(1:10, rep(11:20, 4)), # Vaccinated IDs are unique, Unvaccinated have repeats
  group = c(rep("vaccinated", 10), rep("unvaccinated", 40)),
  T0 = rep(1:10, 5),
  PCP_visits = c(sample(0:10, 10, replace = TRUE), 
                 sample(3:13, 40, replace = TRUE)),
  specialty_visits = c(sample(0:5, 10, replace = TRUE),
                       sample(2:7, 40, replace = TRUE)),
  lab_visits = c(sample(0:8, 10, replace = TRUE),
                 sample(2:10, 40, replace = TRUE))
)

data
   ID        group T0 PCP_visits specialty_visits lab_visits
1   1   vaccinated  1          2                0          3
2   2   vaccinated  2          2                5          0
3   3   vaccinated  3          9                4          5
4   4   vaccinated  4          1                0          2
5   5   vaccinated  5          5                1          7
6   6   vaccinated  6         10                3          2
7   7   vaccinated  7          4                3          7
8   8   vaccinated  8          3                5          0
9   9   vaccinated  9          5                5          6
10 10   vaccinated 10          8                2          6
11 11 unvaccinated  1          9                5          6
12 12 unvaccinated  2         10                5          5
13 13 unvaccinated  3          4                0          6
14 14 unvaccinated  4          2                5          4
15 15 unvaccinated  5         10                1          5
16 16 unvaccinated  6          8                0          7
17 17 unvaccinated  7          8                1          4
18 18 unvaccinated  8          8                3          6
19 19 unvaccinated  9          2                4          3
20 20 unvaccinated 10          7                4          2
21 11 unvaccinated  1          9                5          8
22 12 unvaccinated  2          6                2          6
23 13 unvaccinated  3          9                0          5
24 14 unvaccinated  4          8                3          8
25 15 unvaccinated  5          2                5          6
26 16 unvaccinated  6          3                0          1
27 17 unvaccinated  7          0                5          2
28 18 unvaccinated  8         10                0          7
29 19 unvaccinated  9          6                2          3
30 20 unvaccinated 10          4                5          6
31 11 unvaccinated  1          9                3          3
32 12 unvaccinated  2          6                0          0
33 13 unvaccinated  3          8                5          7
34 14 unvaccinated  4          8                5          3
35 15 unvaccinated  5          9                2          8
36 16 unvaccinated  6          6                5          7
37 17 unvaccinated  7         10                4          5
38 18 unvaccinated  8          4                2          3
39 19 unvaccinated  9          6                5          7
40 20 unvaccinated 10          4                1          2
41 11 unvaccinated  1         10                4          3
42 12 unvaccinated  2          5                4          3
43 13 unvaccinated  3          8                2          5
44 14 unvaccinated  4          1                1          0
45 15 unvaccinated  5          4                1          3
46 16 unvaccinated  6          7                1          8
47 17 unvaccinated  7          1                3          6
48 18 unvaccinated  8          0                1          7
49 19 unvaccinated  9          8                1          4
50 20 unvaccinated 10         10                5          1

The vaccinated and unvaccinated groups are systematically different, and the sampling process needs to make sure the two groups after sampling are similar in terms of PCP_visits, specialty_visits, and lab_visits are similar.

# Charateristics before sampling
data %>% 
  group_by(group) %>% 
  summarise(n = n(),
            PCP_visits = mean(PCP_visits),
            specialty_visits = mean(specialty_visits),
            lab_visits = mean(lab_visits))
# A tibble: 2 × 5
  group            n PCP_visits specialty_visits lab_visits
  <chr>        <int>      <dbl>            <dbl>      <dbl>
1 unvaccinated    40        8.9             4.25       5.72
2 vaccinated      10        4.6             2.3        3.3 

1 Answer 1

0

You can use the CVXR package to find the optimum selection of unvaccinated rows under the constraint that no unvaccinated with the same T0 value is duplicated in the result.

library(dplyr)
library(CVXR)

my_fun <- function(df) {
  dfv <- df[df$group=='vaccinated',]
  dfu <- df[df$group=='unvaccinated',]
  
  PCP_v <- t(dfv$PCP)
  specialty_v <- t(dfv$specialty)
  lab_v <- t(df$lab)
  
  PCP_u <- t(dfu$PCP)
  specialty_u <- t(dfu$specialty)
  lab_u <- t(dfu$lab)
  
  x <- Variable(nrow(dfu), boolean=TRUE)
  
  # The objective is to minimize the mean differences
  objective <- Minimize(
      abs(mean(PCP_v) - mean(PCP_u%*%x)) + 
      abs(mean(specialty_v - specialty_u%*%x)) + 
      abs(mean(lab_v - lab_u%*%x)))

  # Under the constraint that 1 and only 1 unvaccinated record per T0 is chosen.
  constraints <- lapply(unique(dfu$T0), \(i) sum(x[seq_along(dfu$T0)[dfu$T0==i]]) == 1)
  
  prob <- Problem(objective, constraints)
  res <- solve(prob)
  
  arrange(dfu[which(res$getValue(x)==1), ], T0)
}

Call the function, row binding the result to the vaccinated records.

df2 <- filter(data, group=='vaccinated') %>%
  bind_rows(my_fun(data))

Check:

df2 %>% 
  group_by(group) %>% 
  summarise(n = n(),
            PCP_visits = mean(PCP_visits),
            specialty_visits = mean(specialty_visits),
            lab_visits = mean(lab_visits))
# A tibble: 2 × 5
  group            n PCP_visits specialty_visits lab_visits
  <chr>        <int>      <dbl>            <dbl>      <dbl>
1 unvaccinated    10        6.3              3.6        4.7
2 vaccinated      10        4.9              2.8        3.8
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.