I have a dataframe df for which I want to identify the proportion of unique values in col1 which satisfies a condition in col2.
set.seed(137)
df <- data.frame(col1 = sample(LETTERS, 100, TRUE),
col2 = sample(-75:75, 100, TRUE),
col3 = sample(-75:75, 100, TRUE))
df$col2[c(23, 48, 78)] <- NA
df$col3[c(37, 68, 81)] <- NA
For example, I want to find all the unique values in col1 which have values in col2 within the range of -10 to 10 inclusive.
df %>%
mutate(unqCol1 = n_distinct(col1)) %>%
group_by(col1) %>%
mutate(freq = sum(between(col2, -10, 10), na.rm = TRUE)) %>%
filter(freq > 0) %>% distinct(col1, unqCol1) %>%
ungroup() %>%
summarise(nrow(.)/unqCol1) %>%
unique()
which results in:
# A tibble: 1 x 1
`nrow(.)/unqCol1`
<dbl>
1 0.423
Though the above code snippet is not an efficient way of doing it, I tried to achieve the result in single piped-command and it provides me the right output (any clever ways of rewriting the above code are highly appreciatable). I have reconfirmed the output using a base R approach:
length(unique(df$col1[df$col2 >= -10 & df$col2 <= 10 & !is.na(df$col2)]))/length(unique(df$col1))
I would like to re-write the above dplyr code within a function so that it could be replicated with multiple values of n (here: n=10) for the range (for multiple columns too). Is this possible? Or should I pass multiple values within the code itself (without function) like apply-family idea?