Sample random rows in dataframe with probability

Question

How can I take a random sample (with or without replacement) but with given probabilities?

I am trying to extract a random sample of rows in iris data frame but with this condition of species: 80% versicolor and 20% virginica

> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Usually, I use the function : randomRows = function(df,n){ return(df[sample(nrow(df),n,rep=F),]) } — Math
– Math, Commented Apr 25, 2017 at 10:45
Possible duplicate of stackoverflow.com/questions/26110665/… — zx8754
– zx8754, Commented Apr 25, 2017 at 10:53

989 · Accepted Answer · 2017-04-25 10:55:11Z

3

You could try this in base R:

f.sample <- function(a, percent) a[sample(nrow(a), nrow(a)*percent, replace = TRUE),]

f.sample(iris[iris$Species=="versicolor",], 0.8)
f.sample(iris[iris$Species=="virginica",], 0.2)

You can set the replace argument accordingly.

answered Apr 25, 2017 at 10:55

989

13k6 gold badges35 silver badges57 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Umberto Over a year ago

nice one. Was going to write it ;-) Just a word of warning for the OP. In the resulting data frames the Species remains a factor with 3 levels, altough you are only considering 2 (versicolor and virginica). To drop the unused factors (in case you need it) you can use droplevels(df) assuming df is the resulting, filtered data frame.

talat Over a year ago

They want to take a random sample with 80% versicolor and 20% virginica. Your approach implies that the group sizes must be equal since you just sample X% of all group entries. While it is true for the iris data set, this is often not true in real world data sets

989 Over a year ago

@docendodiscimus Regardless of the underlying data set, if we want x% of a particular group, why its not true? I wanna see what I'm missing.

talat Over a year ago

If I understand the OP correctly, they want a random sample of the original data where 80% of the rows of the new data set are "versicolor" and 20% are "virginica". Assume for example that the initial dataset has 50% versicolor, 10% virginica and rest other Species.

989 Over a year ago

@docendodiscimus aha, thanks. that's another standpoint. But given the definition of the question (i.e., for iris data frame), its true.

talat · Accepted Answer · 2017-04-25 11:56:34Z

I seem to have a different understanding than the other answerers.

The following function should produce a 80/20 dataset regardless of the group sizes in the original data set.

foo <- function(DF, n = 50, group_var, groups, probs, replace = FALSE) {

  # subset relevant groups & split
  DF <- DF[DF[[group_var]] %in% groups, ]
  DF <- split(DF, as.character(DF[[group_var]]))
  DF <- DF[match(names(DF), groups)]

  # sample number of observations per group (this requires replace= TRUE)
  smpl <- sample(groups, size = n, replace = TRUE, prob = probs)
  # subset random rows per group according to group size
  DF <- Map(function(x,y) x[sample(1:nrow(x), y, replace = replace),], DF, c(table(smpl)))

  # combine and clean up
  DF <- do.call(rbind, DF)
  DF <- DF[sample(nrow(DF)),]  # not really necessary  
  row.names(DF) <- NULL        # not really necessary  
  DF
}


foo(iris, 50, "Species", c("versicolor", "virginica"), c(0.8, 0.2))

akrun · Accepted Answer · 2017-04-25 12:46:14Z

3

We can make use of the quosures from the devel version of dplyr (soon to be released 0.6.0) for creating the function

library(tidyverse)
f.sample <- function(dat, colN, value, perc){
       colN <- enquo(colN)
       value <- quo_name(enquo(value))
       dat %>%
            filter(UQ(colN) == UQ(value)) %>%
            sample_frac(perc) %>%
            droplevels
}

f.sample(iris, Species, versicolor, 0.8)
f.sample(iris, Species, virginica, 0.2)
#Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#20          6.0         2.2          5.0         1.5 virginica
#9           6.7         2.5          5.8         1.8 virginica
#15          5.8         2.8          5.1         2.4 virginica
#10          7.2         3.6          6.1         2.5 virginica
#12          6.4         2.7          5.3         1.9 virginica
#49          6.2         3.4          5.4         2.3 virginica
#22          5.6         2.8          4.9         2.0 virginica
#34          6.3         2.8          5.1         1.5 virginica
#2           5.8         2.7          5.1         1.9 virginica
#44          6.8         3.2          5.9         2.3 virginica

The enquo does similar functionality as substitute by taking the input arguments and convert it to quosure, while quo_name convert to string, and within the filter/group_by/summarise/mutate the quosures are evaluated by unquoting (!! or UQ)

Based on the comments below, we modified the function so that it would work for other cases

f.sample2 <- function(dat, colN, values, perc){
          colN <- enquo(colN)
          dat %>%
              filter(UQ(colN) %in% values) %>%
              droplevels %>%
              nest(-UQ(colN)) %>%
              .$data %>%
              setNames(values) %>%
              Map(sample_frac, ., perc) %>%
              bind_rows(.id = quo_name(colN))               

        } 


res <- f.sample2(iris, Species, c("versicolor", "virginica"), c(0.8, 0.2))
prop.table(table(res$Species))
#versicolor  virginica 
#      0.8        0.2

edited Apr 25, 2017 at 12:46

answered Apr 25, 2017 at 11:11

akrun

891k38 gold badges590 silver badges700 bronze badges

2 Comments

talat Over a year ago

They want to take a random sample with 80% versicolor and 20% virginica. Your approach implies that the group sizes must be equal since you just sample X% of all group entries. While it is true for the iris data set, this is often not true in real world data sets.

akrun Over a year ago

@docendodiscimus I posted a new function. Does it meet your conditions? Thanks. I was trying to replace the Map with map2 or map, but it was not working correctly. DO you have advice? Thanks

Collectives™ on Stack Overflow

Sample random rows in dataframe with probability

3 Answers 3

5 Comments

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related