How can I generate a random subsample of a population with specific requirements?

Question

Say I have a population of mixed ages and genders (and maybe other attributes), and I want to generate a random subsample (with replacement is ok) with certain attributes, e.g.:

Sample size N
50% of the sample should be age<30
20% of the sample should be male

I could first randomly pick N/2 people with age<30 and age>=30, but this would likely not have the correct gender mix. I could sub-select and ensure that of the age<30 people, 20% are male, but this is too highly specified - I want the overall distributions to match but not specify anything about the product of age and gender.

How do I generate this sample? What if I made it slightly more complicated and specified ranges:

Sample size N
50-80% under age 30 (uniform probability in that range)
20-30% male (uniform probability in that range)

I imagine it might be possible to iteratively generate such a sample, alternately pruning it to match the each requirement until convergence, but I'm not sure how to do it properly. The dumbest way of course would be to just generate random samples and reject them if they don't match these requirements.

The reweight package sounds like it might be helpful: cran.r-project.org/web/packages/reweight/reweight.pdf — Jon Spring
– Jon Spring, Commented Apr 29, 2021 at 5:49

Jon Spring · Accepted Answer · 2021-04-29 06:12:37Z

EDIT:

Here's a sample that is 70% under 30 and 20% male:

N <- 100000
orig_u30 <- 0.7
orig_male <- 0.2
set.seed(42)
my_sample <- data.frame(age = sample(c("under 30", "30+"), N, replace = T, 
                                     prob = c(orig_u30, 1 - orig_u30)),
                        gender = sample(c("M", "F"), N, replace = T, 
                                        prob = c(male, 1-male)))
addmargins(prop.table(table(my_sample$age, my_sample$gender)))
                 F       M     Sum
  30+      0.24292 0.05935 0.30227
  under 30 0.55675 0.14098 0.69773
  Sum      0.79967 0.20033 1.00000

Suppose we want a subsample of those that is weighted instead 40% under 30 and 40% male. We could achieve that by applying weights to each row depending on the relative proportions of what we want vs. what we have.

old_u30 = mean(my_sample$age == "under 30")
new_u30 = 0.4
weight_u30 = (new_u30 / old_u30) / ((1-new_u30) / (1-old_u30))

old_male = mean(my_sample$gender == "M")
new_male = 0.4
weight_male = (new_male / old_male) / ((1-new_male) / (1-old_male))

my_sample$weight = ifelse(my_sample$age == "under 30", weight_u30, 1) *
  ifelse(my_sample$gender == "M", weight_male, 1)

Now we have a weighting for each row that will tend to bring it toward the desired shares:

library(dplyr)
my_subsample <- sample_n(my_sample, 10000, replace = TRUE, weight = my_sample$weight)

addmargins(prop.table(table(my_subsample$age, my_subsample$gender)))

Now it's 40% male and 40% under 30:

                F      M    Sum
  30+      0.3683 0.2348 0.6031
  under 30 0.2375 0.1594 0.3969
  Sum      0.6058 0.3942 1.0000

Orig answer: generated weighted sample but not weighted subsample

N <- 1000
median_age <- 30
male <- 0.2

my_sample <- data.frame(age = rpois(N, median_age),
           gender = sample(c("M", "F"), N, replace = T, prob = c(male, 1-male)))

median(my_sample$age) # will be 30 most runs
table(my_sample$gender) # will be around 200 / 1000

Collectives™ on Stack Overflow

How can I generate a random subsample of a population with specific requirements?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related