3

How can I take a random sample (with or without replacement) but with given probabilities?

I am trying to extract a random sample of rows in iris data frame but with this condition of species: 80% versicolor and 20% virginica

> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa 
3
  • Usually, I use the function : randomRows = function(df,n){ return(df[sample(nrow(df),n,rep=F),]) } Commented Apr 25, 2017 at 10:45
  • Possible duplicate of stackoverflow.com/questions/26110665/… Commented Apr 25, 2017 at 10:53
  • 1
    Also see dplyr::sample_n and dplyr::sample_frac. Commented Apr 25, 2017 at 10:55

3 Answers 3

3

You could try this in base R:

f.sample <- function(a, percent) a[sample(nrow(a), nrow(a)*percent, replace = TRUE),]

f.sample(iris[iris$Species=="versicolor",], 0.8)
f.sample(iris[iris$Species=="virginica",], 0.2)

You can set the replace argument accordingly.

Sign up to request clarification or add additional context in comments.

5 Comments

nice one. Was going to write it ;-) Just a word of warning for the OP. In the resulting data frames the Species remains a factor with 3 levels, altough you are only considering 2 (versicolor and virginica). To drop the unused factors (in case you need it) you can use droplevels(df) assuming df is the resulting, filtered data frame.
They want to take a random sample with 80% versicolor and 20% virginica. Your approach implies that the group sizes must be equal since you just sample X% of all group entries. While it is true for the iris data set, this is often not true in real world data sets
@docendodiscimus Regardless of the underlying data set, if we want x% of a particular group, why its not true? I wanna see what I'm missing.
If I understand the OP correctly, they want a random sample of the original data where 80% of the rows of the new data set are "versicolor" and 20% are "virginica". Assume for example that the initial dataset has 50% versicolor, 10% virginica and rest other Species.
@docendodiscimus aha, thanks. that's another standpoint. But given the definition of the question (i.e., for iris data frame), its true.
3

I seem to have a different understanding than the other answerers.

The following function should produce a 80/20 dataset regardless of the group sizes in the original data set.

foo <- function(DF, n = 50, group_var, groups, probs, replace = FALSE) {

  # subset relevant groups & split
  DF <- DF[DF[[group_var]] %in% groups, ]
  DF <- split(DF, as.character(DF[[group_var]]))
  DF <- DF[match(names(DF), groups)]

  # sample number of observations per group (this requires replace= TRUE)
  smpl <- sample(groups, size = n, replace = TRUE, prob = probs)
  # subset random rows per group according to group size
  DF <- Map(function(x,y) x[sample(1:nrow(x), y, replace = replace),], DF, c(table(smpl)))

  # combine and clean up
  DF <- do.call(rbind, DF)
  DF <- DF[sample(nrow(DF)),]  # not really necessary  
  row.names(DF) <- NULL        # not really necessary  
  DF
}


foo(iris, 50, "Species", c("versicolor", "virginica"), c(0.8, 0.2))

Comments

3

We can make use of the quosures from the devel version of dplyr (soon to be released 0.6.0) for creating the function

library(tidyverse)
f.sample <- function(dat, colN, value, perc){
       colN <- enquo(colN)
       value <- quo_name(enquo(value))
       dat %>%
            filter(UQ(colN) == UQ(value)) %>%
            sample_frac(perc) %>%
            droplevels
}

f.sample(iris, Species, versicolor, 0.8)
f.sample(iris, Species, virginica, 0.2)
#Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#20          6.0         2.2          5.0         1.5 virginica
#9           6.7         2.5          5.8         1.8 virginica
#15          5.8         2.8          5.1         2.4 virginica
#10          7.2         3.6          6.1         2.5 virginica
#12          6.4         2.7          5.3         1.9 virginica
#49          6.2         3.4          5.4         2.3 virginica
#22          5.6         2.8          4.9         2.0 virginica
#34          6.3         2.8          5.1         1.5 virginica
#2           5.8         2.7          5.1         1.9 virginica
#44          6.8         3.2          5.9         2.3 virginica

The enquo does similar functionality as substitute by taking the input arguments and convert it to quosure, while quo_name convert to string, and within the filter/group_by/summarise/mutate the quosures are evaluated by unquoting (!! or UQ)


Based on the comments below, we modified the function so that it would work for other cases

f.sample2 <- function(dat, colN, values, perc){
          colN <- enquo(colN)
          dat %>%
              filter(UQ(colN) %in% values) %>%
              droplevels %>%
              nest(-UQ(colN)) %>%
              .$data %>%
              setNames(values) %>%
              Map(sample_frac, ., perc) %>%
              bind_rows(.id = quo_name(colN))               

        } 


res <- f.sample2(iris, Species, c("versicolor", "virginica"), c(0.8, 0.2))
prop.table(table(res$Species))
#versicolor  virginica 
#      0.8        0.2 

2 Comments

They want to take a random sample with 80% versicolor and 20% virginica. Your approach implies that the group sizes must be equal since you just sample X% of all group entries. While it is true for the iris data set, this is often not true in real world data sets.
@docendodiscimus I posted a new function. Does it meet your conditions? Thanks. I was trying to replace the Map with map2 or map, but it was not working correctly. DO you have advice? Thanks

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.