1

I have a dataframe df for which I want to identify the proportion of unique values in col1 which satisfies a condition in col2.

set.seed(137)
df <- data.frame(col1 = sample(LETTERS, 100, TRUE), 
                 col2 = sample(-75:75, 100, TRUE), 
                 col3 = sample(-75:75, 100, TRUE))

df$col2[c(23, 48, 78)] <- NA
df$col3[c(37, 68, 81)] <- NA

For example, I want to find all the unique values in col1 which have values in col2 within the range of -10 to 10 inclusive.

df %>%  
  mutate(unqCol1 = n_distinct(col1)) %>% 
  group_by(col1) %>% 
  mutate(freq = sum(between(col2, -10, 10), na.rm = TRUE)) %>% 
  filter(freq > 0) %>% distinct(col1, unqCol1) %>% 
  ungroup() %>%  
  summarise(nrow(.)/unqCol1) %>% 
  unique()

which results in:

# A tibble: 1 x 1
  `nrow(.)/unqCol1`
              <dbl>
1             0.423

Though the above code snippet is not an efficient way of doing it, I tried to achieve the result in single piped-command and it provides me the right output (any clever ways of rewriting the above code are highly appreciatable). I have reconfirmed the output using a base R approach:

length(unique(df$col1[df$col2 >= -10 & df$col2 <= 10 & !is.na(df$col2)]))/length(unique(df$col1))

I would like to re-write the above dplyr code within a function so that it could be replicated with multiple values of n (here: n=10) for the range (for multiple columns too). Is this possible? Or should I pass multiple values within the code itself (without function) like apply-family idea?

0

1 Answer 1

1

As you've noticed, your (dplyr) code is overly complicated. You can compute the proportion of interest without grouping the data:

df %>% 
  tidyr::drop_na() %>%
  filter(between(col2, -10, 10)) %>% 
  summarize(prop = n_distinct(col1) / n_distinct(df$col1))

A function for computing the proportion is:

my_summary <- function(df, ...) {
   df %>% 
     tidyr::drop_na() %>%
     filter(...) %>% 
     summarize(
       prop = n_distinct(col1) / n_distinct(df$col1)
     )
}

E.g.

> my_summary(df, between(col2, -10, 10))
       prop
1 0.4230769

gives the proportion in your question.

EDIT

You can vectorize my_summary() and use outer() to get a matrix of proportions for combinations of col and n:

my_summary <- function(col, n) {
  df %>% 
    tidyr::drop_na() %>%
    filter(between(!!as.name(col), -n, n)) %>% 
    summarize(
      prop = n_distinct(col1) / n_distinct(df$col1)
    )
}
my_summary_v <- Vectorize(my_summary)
> outer(c("col2", "col3"), c(10, 20, 30), my_summary_v)
     [,1]      [,2]      [,3]     
[1,] 0.4230769 0.5384615 0.6538462
[2,] 0.4230769 0.6538462 0.6923077
Sign up to request clarification or add additional context in comments.

1 Comment

Let's say the function can be called as my_summary(df, between(col2, -n, n)) for n. Would it be possible to for different ns and different columns at the same time? The expected output n*m dataframe/tibble/matrix.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.