I have a dataset with two groups of people: vaccinated and unvaccinated. In the vaccinated group, each row represents a unique ID with a corresponding unique T0. In the unvaccinated group, each ID may appear in multiple rows, each associated with a different T0 (long format). Each row includes three variables: the number of PCP visits, specialty visits, and lab visits.
I aim to sample from this dataset such that the resulting data includes one unique ID per row and the average values for PCP visits, specialty visits, and lab visits are similar between the vaccinated and unvaccinated groups. How can I achieve this in R? I think this will at least involve some stratified sampling for the unvaccinated group because each person can have multiple records.
Below is some R code to create a sample data:
set.seed(123)
data <- data.frame(
ID = c(1:10, rep(11:20, 4)), # Vaccinated IDs are unique, Unvaccinated have repeats
group = c(rep("vaccinated", 10), rep("unvaccinated", 40)),
T0 = rep(1:10, 5),
PCP_visits = c(sample(0:10, 10, replace = TRUE),
sample(3:13, 40, replace = TRUE)),
specialty_visits = c(sample(0:5, 10, replace = TRUE),
sample(2:7, 40, replace = TRUE)),
lab_visits = c(sample(0:8, 10, replace = TRUE),
sample(2:10, 40, replace = TRUE))
)
data
ID group T0 PCP_visits specialty_visits lab_visits
1 1 vaccinated 1 2 0 3
2 2 vaccinated 2 2 5 0
3 3 vaccinated 3 9 4 5
4 4 vaccinated 4 1 0 2
5 5 vaccinated 5 5 1 7
6 6 vaccinated 6 10 3 2
7 7 vaccinated 7 4 3 7
8 8 vaccinated 8 3 5 0
9 9 vaccinated 9 5 5 6
10 10 vaccinated 10 8 2 6
11 11 unvaccinated 1 9 5 6
12 12 unvaccinated 2 10 5 5
13 13 unvaccinated 3 4 0 6
14 14 unvaccinated 4 2 5 4
15 15 unvaccinated 5 10 1 5
16 16 unvaccinated 6 8 0 7
17 17 unvaccinated 7 8 1 4
18 18 unvaccinated 8 8 3 6
19 19 unvaccinated 9 2 4 3
20 20 unvaccinated 10 7 4 2
21 11 unvaccinated 1 9 5 8
22 12 unvaccinated 2 6 2 6
23 13 unvaccinated 3 9 0 5
24 14 unvaccinated 4 8 3 8
25 15 unvaccinated 5 2 5 6
26 16 unvaccinated 6 3 0 1
27 17 unvaccinated 7 0 5 2
28 18 unvaccinated 8 10 0 7
29 19 unvaccinated 9 6 2 3
30 20 unvaccinated 10 4 5 6
31 11 unvaccinated 1 9 3 3
32 12 unvaccinated 2 6 0 0
33 13 unvaccinated 3 8 5 7
34 14 unvaccinated 4 8 5 3
35 15 unvaccinated 5 9 2 8
36 16 unvaccinated 6 6 5 7
37 17 unvaccinated 7 10 4 5
38 18 unvaccinated 8 4 2 3
39 19 unvaccinated 9 6 5 7
40 20 unvaccinated 10 4 1 2
41 11 unvaccinated 1 10 4 3
42 12 unvaccinated 2 5 4 3
43 13 unvaccinated 3 8 2 5
44 14 unvaccinated 4 1 1 0
45 15 unvaccinated 5 4 1 3
46 16 unvaccinated 6 7 1 8
47 17 unvaccinated 7 1 3 6
48 18 unvaccinated 8 0 1 7
49 19 unvaccinated 9 8 1 4
50 20 unvaccinated 10 10 5 1
The vaccinated and unvaccinated groups are systematically different, and the sampling process needs to make sure the two groups after sampling are similar in terms of PCP_visits, specialty_visits, and lab_visits are similar.
# Charateristics before sampling
data %>%
group_by(group) %>%
summarise(n = n(),
PCP_visits = mean(PCP_visits),
specialty_visits = mean(specialty_visits),
lab_visits = mean(lab_visits))
# A tibble: 2 × 5
group n PCP_visits specialty_visits lab_visits
<chr> <int> <dbl> <dbl> <dbl>
1 unvaccinated 40 8.9 4.25 5.72
2 vaccinated 10 4.6 2.3 3.3