2

I do this a lot:

library(tidyverse)

iris %>% 
  group_by(Species) %>% 
  summarise(num_Species = n_distinct(Species)) %>% 
  mutate(perc_Species = 100 * num_Species / sum(num_Species))

So I would like to create a function that outputs the same thing but with dynamically named num_ and perc_ columns:

num_perc <- function(df, group_var, summary_var) {
  
}

I found this resource useful but it did not directly address how to reuse newly created column names in the way I want.

3 Answers 3

6

What you can do is use as_label(enquo()) on your group_var to extract variable passed as a character vector to generate your new columns. You can see a clear example of this is 6.1.3 in the linked document you sent. In this way, we can dynamically prepend num_ and perc_ to your summary variable, and just have to pass in df and group_var.

library(dplyr)

num_perc <- function(df, group_var) {
  summary_lbl <- as_label(enquo(group_var))
  num_lbl <- paste0("num_", summary_lbl)
  perc_lbl <- paste0("perc_", summary_lbl)
  
  df %>%
    group_by({{ group_var }}) %>%
    summarize(!!num_lbl := n_distinct({{ group_var }})) %>%
    mutate(!!perc_lbl := 100 * .data[[num_lbl]] / sum(.data[[num_lbl]]))
}

num_perc(iris, Species)
#> # A tibble: 3 × 3
#>   Species    num_Species perc_Species
#>   <fct>            <int>        <dbl>
#> 1 setosa               1         33.3
#> 2 versicolor           1         33.3
#> 3 virginica            1         33.3

In this case where group_var and summary_var actually differ, it's the same solution essentially.

num_perc <- function(df, group_var, summary_var) {
  summary_lbl <- as_label(enquo(summary_var))
  num_lbl <- paste0("num_", summary_lbl)
  perc_lbl <- paste0("perc_", summary_lbl)
  
  df %>%
    group_by({{ group_var }}) %>%
    summarize(!!num_lbl := n_distinct({{ summary_var }})) %>%
    mutate(!!perc_lbl := 100 * .data[[num_lbl]] / sum(.data[[num_lbl]]))
}

num_perc(iris, Species, Species)
Sign up to request clarification or add additional context in comments.

Comments

2

Another possible solution, which uses deparse(substitute(...)) to get the name of the function parameters as strings:

library(tidyverse)

f <- function(df, group_var, summary_var)
{
  group_var <- deparse(substitute(group_var))
  summary_var <- deparse(substitute(summary_var))

  df %>% 
    group_by(!!sym(group_var)) %>% 
    summarise(!!str_c("num_", summary_var) := n_distinct(summary_var)) %>% 
    mutate(!!str_c("per_", summary_var) := 100 * !!sym(str_c("num_", summary_var)) / sum(!!sym(str_c("num_", summary_var))))
}

f(iris, Species, Species)

#> # A tibble: 3 × 3
#>   Species    num_Species per_Species
#>   <fct>            <int>       <dbl>
#> 1 setosa               1        33.3
#> 2 versicolor           1        33.3
#> 3 virginica            1        33.3

Comments

1

Are you sure n_distinct is what you want to do? In the case of the iris dataset, there are three Species - setosa, versicolor, virginica. Therefore, each species is 1/3 unique species. The Iris dataset is balanced in the sense that there are 50 of each species, so each species represents 1/3 of the data set but more generally this will not be the case.

A function with data masking will cover imbalanced datasets for you:

library(dplyr)
my_func <- function(df, var, percent){
  df %>%
    count({{var}}) %>%
    mutate(percent = 100 * n/sum(n))
}

my_func(iris, Species, percent)

iris %>%
  my_func(Species, percent) #or with pipe

2 Comments

Oh yeah iris was only an example dataset, I need to count distinct.
Ok nice, thought I'd clarify! The other answers should see you through :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.