0

I have a data frame such as

df <- data.frame(matrix(rnorm(40), nrow=20))
df$color <-  rep(c("blue", "red", "yellow", "pink"), each=5)
df$score <- rep(c(1,2,3,5), each = 5)

I want to sample the rows based on two columns color and score into two data frames such that I get an almost equal number of rows from each group in each data frame. For example, I have 5 rows with the color blue and score 1. I want 2 in one data frame and 3 in another data frame. If I have sis rows in a group 3 should go to one data frame and 3 to another.

1 Answer 1

1

If I've understood correctly, you can try something like:

set.seed(10)

df <- data.frame(matrix(rnorm(40), nrow=20))
df$color <-  rep(c("blue", "red", "yellow", "pink"), each=5)
df$score <- rep(c(1,2,3,5), each = 5)

library(dplyr)

df %>%
  group_by(color, score) %>%
  mutate(grp = sample(seq_along(score) %% 2)) %>%
  group_by(grp) %>%
  group_split()


[[1]]
# A tibble: 8 x 5
      X1     X2 color  score   grp
   <dbl>  <dbl> <chr>  <dbl> <dbl>
1  0.675  0.257 blue       1     0
2 -0.548  0.365 blue       1     0
3 -1.89   0.851 red        2     0
4  1.09  -0.173 red        2     0
5  1.65  -0.500 yellow     3     0
6 -0.186  0.564 yellow     3     0
7 -0.208 -1.70  pink       5     0
8  0.661  0.447 pink       5     0

[[2]]
# A tibble: 12 x 5
        X1      X2 color  score   grp
     <dbl>   <dbl> <chr>  <dbl> <dbl>
 1  0.0555  2.12   blue       1     1
 2 -0.738  -0.843  blue       1     1
 3  0.833  -0.939  blue       1     1
 4 -1.57   -0.172  red        2     1
 5  1.43    0.767  red        2     1
 6  1.14    1.32   red        2     1
 7  1.01    0.997  yellow     3     1
 8 -1.20   -0.357  yellow     3     1
 9  0.474  -0.0911 yellow     3     1
10 -2.44    0.765  pink       5     1
11  1.15    0.463  pink       5     1
12 -0.426   1.53   pink       5     1
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for your answer. If I have a different number of rows in each group, how to make sure that both data frames have an almost equal number of samples.
Could you give an example of your expected output?
The code splits the data into two data frames 8 and 12 rows. The more balanced one will be both with 10 rows. I mean grouping on two columns and having almost equal number of rows in each data frame.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.