dplyr summarize: create variables from named vector

Question

Here's my problem:

I am using a function that returns a named vector. Here's a toy example:

toy_fn <- function(x) {
    y <- c(mean(x), sum(x), median(x), sd(x))
    names(y) <- c("Right", "Wrong", "Unanswered", "Invalid")
    y
}

I am using group_by in dplyr to apply this function for each group (typical split-apply-combine). So, here's my toy data.frame:

set.seed(1234567)
toy_df <- data.frame(id = 1:1000, 
                     group = sample(letters, 1000, replace = TRUE), 
                     value = runif(1000))

And here's the result I am aiming for:

toy_summary <- 
    toy_df %>% 
    group_by(group) %>% 
    summarize(Right = toy_fn(value)["Right"], 
              Wrong = toy_fn(value)["Wrong"], 
              Unanswered = toy_fn(value)["Unanswered"], 
              Invalid = toy_fn(value)["Invalid"])

> toy_summary
Source: local data frame [26 x 5]

   group     Right    Wrong Unanswered   Invalid
1      a 0.5038394 20.15358  0.5905526 0.2846468
2      b 0.5048040 15.64892  0.5163702 0.2994544
3      c 0.5029442 21.62660  0.5072733 0.2465612
4      d 0.5124601 14.86134  0.5382463 0.2681955
5      e 0.4649483 17.66804  0.4426197 0.3075080
6      f 0.5622644 12.36982  0.6330269 0.2850609
7      g 0.4675324 14.96104  0.4692404 0.2746589

It works! But it is just not cool to call four times the same function. I would rather like dplyr to get the named vector and create a new variable for each element in the vector. Something like this:

toy_summary <- 
    toy_df %>% 
    group_by(group) %>% 
    summarize(toy_fn(value))

This, unfortunately, does not work because "Error: expecting a single value".

I thought, ok, let's just convert the vector to a data.frame using data.frame(as.list(x)). But this does not work either. I tried many things but I couldn't trick dplyr into think it's actually receiving one single value (observation) for 4 different variables. Is there any way to help dplyr realize that?.

David Arenburg · Accepted Answer · 2015-05-25 15:56:13Z

6

One possible solution is to use dplyr SE capabilities. For example, set you function as follows

dots <- setNames(list(  ~ mean(value),  
                         ~ sum(value),  
                      ~ median(value), 
                         ~ sd(value)),  
                 c("Right", "Wrong", "Unanswered", "Invalid"))

Then, you can use summarize_ (with a _) as follows

toy_df %>% 
  group_by(group) %>% 
  summarize_(.dots = dots)
# Source: local data table [26 x 5]
# 
#    group     Right    Wrong Unanswered   Invalid
# 1      o 0.4490776 17.51403  0.4012057 0.2749956
# 2      s 0.5079569 15.23871  0.4663852 0.2555774
# 3      x 0.4620649 14.78608  0.4475117 0.2894502
# 4      a 0.5038394 20.15358  0.5905526 0.2846468
# 5      t 0.5041168 24.19761  0.5330790 0.3171022
# 6      m 0.4806628 21.14917  0.4805273 0.2825026
# 7      c 0.5029442 21.62660  0.5072733 0.2465612
# 8      w 0.4932484 17.75694  0.4891746 0.3309680
# 9      q 0.5350707 22.47297  0.5608505 0.2749941
# 10     g 0.4675324 14.96104  0.4692404 0.2746589
# ..   ...       ...      ...        ...       ...

Though it looks nice, there is a big catch here. You have to know the column you are going to operate on a priori (value) when setting up the function, so it won't work on some other column name, if you won't set up dots properly.

As a bonus here's a simple solution using data.table using your original function

library(data.table)
setDT(toy_df)[, as.list(toy_fn(value)), by = group]
#     group     Right    Wrong Unanswered   Invalid
#  1:     o 0.4490776 17.51403  0.4012057 0.2749956
#  2:     s 0.5079569 15.23871  0.4663852 0.2555774
#  3:     x 0.4620649 14.78608  0.4475117 0.2894502
#  4:     a 0.5038394 20.15358  0.5905526 0.2846468
#  5:     t 0.5041168 24.19761  0.5330790 0.3171022
#  6:     m 0.4806628 21.14917  0.4805273 0.2825026
#  7:     c 0.5029442 21.62660  0.5072733 0.2465612
#  8:     w 0.4932484 17.75694  0.4891746 0.3309680
#  9:     q 0.5350707 22.47297  0.5608505 0.2749941
# 10:     g 0.4675324 14.96104  0.4692404 0.2746589
#...

answered May 25, 2015 at 15:56

David Arenburg

92.4k18 gold badges145 silver badges202 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Hernando Casas Over a year ago

Nice using data.table. The dplyr-based solution you propose does not work for me because I cannot modify the function. I like it very much thee data.table way, although I was looking for a dplyr-based solution because I have to call this function after a bunch of full_join, filtering and other data wrangling done using dplyr. So it seems natural to use dplyr as well.

David Arenburg Over a year ago

What do you mean by "I cannot modify the function"?

Hernando Casas Over a year ago

I meant that I cannot set the function as you suggest, because you create an object with a formula for each of the return values of my example function (toy_fn).That, however, was only an example and my real-life application does not involve computing the mean, sum, median and sd. Instead, it's a function that compares the data to reference values in another database (uses RODBC to connect to the other database and obtain updated reference values) and return four values (in a named vector) that indicate the result of the comparison.I cannot call a single function to obtain each of these values

Josh W. · Accepted Answer · 2015-05-25 18:08:48Z

3

You can also try this with do():

toy_df %>%
  group_by(group) %>%
  do(res = toy_fn(.$value))

answered May 25, 2015 at 18:08

Josh W.

1,1431 gold badge10 silver badges17 bronze badges

8 Comments

Josh W. Over a year ago

I tested it on my computer - it does work, the resulting data frame does take some parsing though.

Hernando Casas Over a year ago

what kind of parsing?, ..., I couldn't look at it carefully 'cause I checked it out in my phone.

Josh W. Over a year ago

result is a tbl_df of the form: group | res --------|------ a | <dbl> b | <dbl> You can extract the value for the first result with (if you assigned the above value to df) with df$res[1]

bergant Over a year ago

if df1 is a result, try cbind(df1$group, do.call(rbind, df1$res))

Henrik Over a year ago

@HernandoCasas If you are willing to load another package, you may simply add tidyr::unnest() to the code from Josh W. The 'res' variable here is a list column which can be 'unnested' using unnest().

|

bergant · Accepted Answer · 2015-05-25 15:59:47Z

3

This is not a dplyr solution, but if you like pipes:

library(magrittr)

toy_summary <-
  toy_df %>% 
  split(.$group) %>% 
  lapply( function(x) toy_fn(x$value) ) %>% 
  do.call(rbind, .)

# > head(toy_summary)
#         Right    Wrong Unanswered   Invalid
#   a 0.5038394 20.15358  0.5905526 0.2846468
#   b 0.5048040 15.64892  0.5163702 0.2994544
#   c 0.5029442 21.62660  0.5072733 0.2465612
#   d 0.5124601 14.86134  0.5382463 0.2681955
#   e 0.4649483 17.66804  0.4426197 0.3075080
#   f 0.5622644 12.36982  0.6330269 0.2850609

edited May 25, 2015 at 15:59

answered May 25, 2015 at 15:53

bergant

7,2521 gold badge22 silver badges24 bronze badges

2 Comments

Hernando Casas Over a year ago

Many thanks. I like it very much. I was looking for a dplyr-based solution because I have to call this function after a bunch of full_join, filtering and other data wrangling done using dplyr. So it seems natural to use dplyr as well. But this is cool and works perfectly.

bergant Over a year ago

@HernandoCasas You can combine dplyr functions before or after this chain (because the input and output is a data.frame). But you can't use it between the sequence.

talat · Accepted Answer · 2015-05-25 16:43:46Z

3

Apparently there's a problem when using median (not sure what's going on there) but apart from that you can normally use an approach like the following with summarise_each to apply multiple functions. Note that you can specify the names of resulting columns by using a named vector as input to funs_():

x <- c(Right = "mean", Wrong = "sd", Unanswered = "sum")

toy_df %>% 
  group_by(group) %>% 
  summarise_each(funs_(x), value)

#Source: local data frame [26 x 4]
#
#   group     Right     Wrong Unanswered
#1      a 0.5038394 0.2846468   20.15358
#2      b 0.5048040 0.2994544   15.64892
#3      c 0.5029442 0.2465612   21.62660
#4      d 0.5124601 0.2681955   14.86134
#5      e 0.4649483 0.3075080   17.66804
#6      f 0.5622644 0.2850609   12.36982
#7      g 0.4675324 0.2746589   14.96104
#8      h 0.4921506 0.2879830   21.16248
#9      i 0.5443600 0.2945428   22.31876
#10     j 0.5276048 0.3236814   20.57659
#..   ...       ...       ...        ...

answered May 25, 2015 at 16:43

talat

70.5k22 gold badges130 silver badges158 bronze badges

5 Comments

Henrik Over a year ago

I don't think you need funs_ here. A "character vector of function names" should be enough. See e.g. the summarise_each(c("min", "max")) example. Weird indeed with median.

Hernando Casas Over a year ago

Thanks. For this particular example it works very nice. But in my real application I cannot call a different function for each of the values that I need to compute. It's my fault anyway. I wasn't clear enough that the function that I put in the post was just to have a reproducible example, but the function that I need to call on each group is much more complex and not just calls to mean, median, etc. Also, it's a function that I cannot change.

talat Over a year ago

@HernandoCasas, no problem, it was clear that it was only an example. Maybe you can provide the function or clarify more precisely what you need

Hernando Casas Over a year ago

unfortunately I cannot post the function here. Not sure that it helps either. It's a function that compares the data to reference values in another database (uses RODBC to connect to the other database and obtain updated reference values that change daily) and return four values (in a named vector) that indicate the result of the comparison. But I cannot call a function to obtain each of these values (so, I cannot use summarise_each).

Henrik Over a year ago

@docendodiscimus I posted an issue about "median"

Erwan LE PENNEC · Accepted Answer · 2016-10-06 16:57:28Z

1

using the sequence of list(as_tibble(as.list(...)) followed by an unnest from tidyr does the trick

toy_summary2 <- toy_df %>% group_by(group) %>% 
summarize(Col = list(as_tibble(as.list(toy_fn(value)))))  %>% unnest()

answered Oct 6, 2016 at 16:57

Erwan LE PENNEC

5263 silver badges10 bronze badges

Collectives™ on Stack Overflow

dplyr summarize: create variables from named vector

5 Answers 5

3 Comments

8 Comments

2 Comments

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

3 Comments

8 Comments

2 Comments

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related