Efficient way to create stratified subsamples of a data frame, depending on frequency of a category

Question

I want to create a sub-sample of data frame df, depending on the frequency of a given category in one of its columns, e.g. a.

Let's assume we have a data frame like this:

df <- data.frame(a = rep(1:4, c(3, 9, 4, 8)),
                 b = runif(24))

then I want to get a sub-sample of rows, proportional to the categories in column a, first in a random way:

smpl <- unlist(lapply(1:4, \(x) sample(c(TRUE, FALSE), 
                                       size = sum(x==df$a), 
                                       replace = TRUE)))
df[smpl,]

Here sample leads to the intended effect, that half of the records are returned on average for each category. However, it may be more or less (and even zero) for a category in a specific case.

I am also looking for second "more deterministic" approach, where only the cases are selected at random, but returns for each category either 50% of cases in the even case or N %/% 2 resp. N %/% 2 +1 records in the uneven case. The code should be easily readable.

can you explain what you mean by "50% +/-1 of the corresponding rows"?.. Also, are you not satisfied with the approach you already have for the first approach? — langtang
– langtang, Commented Mar 2, 2023 at 0:50
With 50% +/-1, I meant either integer division (%/%) or integer division +1. The question was edited to improve clarity. The code is for a teaching project where I am seeking for elegant and clear solutions, understandable by beginners. A tidyverse version would also be welcome. — tpetzoldt
– tpetzoldt, Commented Mar 2, 2023 at 6:16

tpetzoldt · Accepted Answer · 2023-03-02 07:22:35Z

0

In the meantime, I found a possible solution myself. First I searched for "stratified" instead of "weighted" and changed the question title accordingly. Then, function slice_sample was found in package dplyr. It can be run with two optional arguments n and prop, so we can do:

Case 1:

df |> slice_sample(n = nrow(df) %/% 2, weight_by = a)

Case 2:

df |> slice_sample(prop=0.5, weight_by = a)

edited Mar 2, 2023 at 7:22

answered Mar 2, 2023 at 6:50

tpetzoldt

5,8382 gold badges14 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Efficient way to create stratified subsamples of a data frame, depending on frequency of a category

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related