0

Say I have a population of mixed ages and genders (and maybe other attributes), and I want to generate a random subsample (with replacement is ok) with certain attributes, e.g.:

  • Sample size N
  • 50% of the sample should be age<30
  • 20% of the sample should be male

I could first randomly pick N/2 people with age<30 and age>=30, but this would likely not have the correct gender mix. I could sub-select and ensure that of the age<30 people, 20% are male, but this is too highly specified - I want the overall distributions to match but not specify anything about the product of age and gender.

How do I generate this sample? What if I made it slightly more complicated and specified ranges:

  • Sample size N
  • 50-80% under age 30 (uniform probability in that range)
  • 20-30% male (uniform probability in that range)

I imagine it might be possible to iteratively generate such a sample, alternately pruning it to match the each requirement until convergence, but I'm not sure how to do it properly. The dumbest way of course would be to just generate random samples and reject them if they don't match these requirements.

1

1 Answer 1

2

EDIT:

Here's a sample that is 70% under 30 and 20% male:

N <- 100000
orig_u30 <- 0.7
orig_male <- 0.2
set.seed(42)
my_sample <- data.frame(age = sample(c("under 30", "30+"), N, replace = T, 
                                     prob = c(orig_u30, 1 - orig_u30)),
                        gender = sample(c("M", "F"), N, replace = T, 
                                        prob = c(male, 1-male)))
addmargins(prop.table(table(my_sample$age, my_sample$gender)))
                 F       M     Sum
  30+      0.24292 0.05935 0.30227
  under 30 0.55675 0.14098 0.69773
  Sum      0.79967 0.20033 1.00000

Suppose we want a subsample of those that is weighted instead 40% under 30 and 40% male. We could achieve that by applying weights to each row depending on the relative proportions of what we want vs. what we have.

old_u30 = mean(my_sample$age == "under 30")
new_u30 = 0.4
weight_u30 = (new_u30 / old_u30) / ((1-new_u30) / (1-old_u30))

old_male = mean(my_sample$gender == "M")
new_male = 0.4
weight_male = (new_male / old_male) / ((1-new_male) / (1-old_male))

my_sample$weight = ifelse(my_sample$age == "under 30", weight_u30, 1) *
  ifelse(my_sample$gender == "M", weight_male, 1)

Now we have a weighting for each row that will tend to bring it toward the desired shares:

library(dplyr)
my_subsample <- sample_n(my_sample, 10000, replace = TRUE, weight = my_sample$weight)

addmargins(prop.table(table(my_subsample$age, my_subsample$gender)))

Now it's 40% male and 40% under 30:

                F      M    Sum
  30+      0.3683 0.2348 0.6031
  under 30 0.2375 0.1594 0.3969
  Sum      0.6058 0.3942 1.0000

Orig answer: generated weighted sample but not weighted subsample

N <- 1000
median_age <- 30
male <- 0.2

my_sample <- data.frame(age = rpois(N, median_age),
           gender = sample(c("M", "F"), N, replace = T, prob = c(male, 1-male)))

median(my_sample$age) # will be 30 most runs
table(my_sample$gender) # will be around 200 / 1000
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.