2

I have a specific use problem. I want to sample exact sizes from within groups. What method should I use to construct exact subsets based on group counts?

My use case is that I am going through a two-stage sample design. First, for each group in my population, I want to ensure that 60% of subjects will not be selected. So I am trying to construct a sampling data frame that excludes 60% of available subjects for each group. Further, this is a function where the user specifies the minimum proportion of subjects that must not be used, hence the 1- construction where the user has indicated that at least 60% of subjects in each group cannot be selected for sampling.

After this code, I will be sampling completely at random, to get my final sample.

Code example:

testing <- data.frame(ID = c(seq_len(50)), Age = c(rep(18, 10), rep(19, 9), rep(20,15), rep(21,16)))

testing <- testing %>%
slice_sample(ID, prop=1-.6)

As you can see, the numbers by group are not what I want. I should only have 4 subjects who are 18 years of age, 3 subjects who are 19 years, 6 subjects who are 20 years of age, and 6 subjects who are 21 years of age. With no set seed, the numbers I ended up with were 6 18-year-olds, 1 19-year-old, 6 20-year-olds, and 7 21-year-olds.

However, the overall sample size of 20 is correct.

How do I brute force the sample size within the groups to be what I need?

There are other variables in the data frame so I need to sample randomly from each age group.

EDIT: Messed up trying to give an example. In my real data I am grouping by age inside the dplyr set of commands. But neither group-by([Age variable) ahead of slice_sample() or doing the grouping inside slice_sample() work. In my real data, I get neither the correct set of samples by age, nor do I get the correct overall sample size.

I was using a semi_join to limit the ages to those that had a total remaining after doing the proportion test. For those ages for which no sample could be taken, the semi_join was being used to remove those ages from the population ahead of doing the proportional sampling. I don't know if the semi_join has caused the problem.

That said, the answer provided and accepted shifts me away from relying on the semi_join and I think is an overall large improvement to my real code.

3
  • I had started to use those, but the notes are that those two functions are deprecated and to use slice_sample instead. Commented Jun 26, 2020 at 23:09
  • Group by "Age" then slice_sample? Commented Jun 26, 2020 at 23:11
  • It doesn't floor the sample, so I get too many sampled from the counts that provide a remainder. :( This then breaches the "must be at least this number remaining" requirement. Commented Jun 26, 2020 at 23:16

1 Answer 1

3

You haven't defined your grouping variable.

Try the following:

set.seed(1)
x <- testing %>% group_by(Age) %>% slice_sample(prop = .4)
x %>% count()
# # A tibble: 4 x 2
# # Groups:   Age [4]
#     Age     n
#   <dbl> <int>
# 1    18     4
# 2    19     3
# 3    20     6
# 4    21     6

Alternatively, try stratified from my "splitstackshape" package:

library(splitstackshape)
set.seed(1)
y <- stratified(testing, "Age", .4)
y[, .N, Age]
#    Age N
# 1:  18 4
# 2:  19 4
# 3:  20 6
# 4:  21 6
Sign up to request clarification or add additional context in comments.

2 Comments

Something is going wrong with my real data as that is not what I am getting. For example, one of the counts should give me 6 ages and it's giving me 8. I'm going to have to toss this one to my supervisor. However, your suggestion worked for the simple case, so voting this as the answer. Thanks!
@Michelle, OK. Note that stratified can take a named vector of desired sample sizes. So, for example, you can specify size as something like c("18" = 1, "19" = 3, "20" = 2, "21" = 4) if you wanted 1, 2, 3, and 4 samples each from age groups 18, 19, 20, and 21 respectively.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.