8

Here's my problem:

I am using a function that returns a named vector. Here's a toy example:

toy_fn <- function(x) {
    y <- c(mean(x), sum(x), median(x), sd(x))
    names(y) <- c("Right", "Wrong", "Unanswered", "Invalid")
    y
}

I am using group_by in dplyr to apply this function for each group (typical split-apply-combine). So, here's my toy data.frame:

set.seed(1234567)
toy_df <- data.frame(id = 1:1000, 
                     group = sample(letters, 1000, replace = TRUE), 
                     value = runif(1000))

And here's the result I am aiming for:

toy_summary <- 
    toy_df %>% 
    group_by(group) %>% 
    summarize(Right = toy_fn(value)["Right"], 
              Wrong = toy_fn(value)["Wrong"], 
              Unanswered = toy_fn(value)["Unanswered"], 
              Invalid = toy_fn(value)["Invalid"])

> toy_summary
Source: local data frame [26 x 5]

   group     Right    Wrong Unanswered   Invalid
1      a 0.5038394 20.15358  0.5905526 0.2846468
2      b 0.5048040 15.64892  0.5163702 0.2994544
3      c 0.5029442 21.62660  0.5072733 0.2465612
4      d 0.5124601 14.86134  0.5382463 0.2681955
5      e 0.4649483 17.66804  0.4426197 0.3075080
6      f 0.5622644 12.36982  0.6330269 0.2850609
7      g 0.4675324 14.96104  0.4692404 0.2746589

It works! But it is just not cool to call four times the same function. I would rather like dplyr to get the named vector and create a new variable for each element in the vector. Something like this:

toy_summary <- 
    toy_df %>% 
    group_by(group) %>% 
    summarize(toy_fn(value))

This, unfortunately, does not work because "Error: expecting a single value".

I thought, ok, let's just convert the vector to a data.frame using data.frame(as.list(x)). But this does not work either. I tried many things but I couldn't trick dplyr into think it's actually receiving one single value (observation) for 4 different variables. Is there any way to help dplyr realize that?.

5 Answers 5

6

One possible solution is to use dplyr SE capabilities. For example, set you function as follows

dots <- setNames(list(  ~ mean(value),  
                         ~ sum(value),  
                      ~ median(value), 
                         ~ sd(value)),  
                 c("Right", "Wrong", "Unanswered", "Invalid"))

Then, you can use summarize_ (with a _) as follows

toy_df %>% 
  group_by(group) %>% 
  summarize_(.dots = dots)
# Source: local data table [26 x 5]
# 
#    group     Right    Wrong Unanswered   Invalid
# 1      o 0.4490776 17.51403  0.4012057 0.2749956
# 2      s 0.5079569 15.23871  0.4663852 0.2555774
# 3      x 0.4620649 14.78608  0.4475117 0.2894502
# 4      a 0.5038394 20.15358  0.5905526 0.2846468
# 5      t 0.5041168 24.19761  0.5330790 0.3171022
# 6      m 0.4806628 21.14917  0.4805273 0.2825026
# 7      c 0.5029442 21.62660  0.5072733 0.2465612
# 8      w 0.4932484 17.75694  0.4891746 0.3309680
# 9      q 0.5350707 22.47297  0.5608505 0.2749941
# 10     g 0.4675324 14.96104  0.4692404 0.2746589
# ..   ...       ...      ...        ...       ...

Though it looks nice, there is a big catch here. You have to know the column you are going to operate on a priori (value) when setting up the function, so it won't work on some other column name, if you won't set up dots properly.


As a bonus here's a simple solution using data.table using your original function

library(data.table)
setDT(toy_df)[, as.list(toy_fn(value)), by = group]
#     group     Right    Wrong Unanswered   Invalid
#  1:     o 0.4490776 17.51403  0.4012057 0.2749956
#  2:     s 0.5079569 15.23871  0.4663852 0.2555774
#  3:     x 0.4620649 14.78608  0.4475117 0.2894502
#  4:     a 0.5038394 20.15358  0.5905526 0.2846468
#  5:     t 0.5041168 24.19761  0.5330790 0.3171022
#  6:     m 0.4806628 21.14917  0.4805273 0.2825026
#  7:     c 0.5029442 21.62660  0.5072733 0.2465612
#  8:     w 0.4932484 17.75694  0.4891746 0.3309680
#  9:     q 0.5350707 22.47297  0.5608505 0.2749941
# 10:     g 0.4675324 14.96104  0.4692404 0.2746589
#...
Sign up to request clarification or add additional context in comments.

3 Comments

Nice using data.table. The dplyr-based solution you propose does not work for me because I cannot modify the function. I like it very much thee data.table way, although I was looking for a dplyr-based solution because I have to call this function after a bunch of full_join, filtering and other data wrangling done using dplyr. So it seems natural to use dplyr as well.
What do you mean by "I cannot modify the function"?
I meant that I cannot set the function as you suggest, because you create an object with a formula for each of the return values of my example function (toy_fn).That, however, was only an example and my real-life application does not involve computing the mean, sum, median and sd. Instead, it's a function that compares the data to reference values in another database (uses RODBC to connect to the other database and obtain updated reference values) and return four values (in a named vector) that indicate the result of the comparison.I cannot call a single function to obtain each of these values
3

You can also try this with do():

toy_df %>%
  group_by(group) %>%
  do(res = toy_fn(.$value))

8 Comments

I tested it on my computer - it does work, the resulting data frame does take some parsing though.
what kind of parsing?, ..., I couldn't look at it carefully 'cause I checked it out in my phone.
result is a tbl_df of the form: group | res --------|------ a | <dbl> b | <dbl> You can extract the value for the first result with (if you assigned the above value to df) with df$res[1]
if df1 is a result, try cbind(df1$group, do.call(rbind, df1$res))
@HernandoCasas If you are willing to load another package, you may simply add tidyr::unnest() to the code from Josh W. The 'res' variable here is a list column which can be 'unnested' using unnest().
|
3

This is not a dplyr solution, but if you like pipes:

library(magrittr)

toy_summary <-
  toy_df %>% 
  split(.$group) %>% 
  lapply( function(x) toy_fn(x$value) ) %>% 
  do.call(rbind, .)

# > head(toy_summary)
#         Right    Wrong Unanswered   Invalid
#   a 0.5038394 20.15358  0.5905526 0.2846468
#   b 0.5048040 15.64892  0.5163702 0.2994544
#   c 0.5029442 21.62660  0.5072733 0.2465612
#   d 0.5124601 14.86134  0.5382463 0.2681955
#   e 0.4649483 17.66804  0.4426197 0.3075080
#   f 0.5622644 12.36982  0.6330269 0.2850609      

2 Comments

Many thanks. I like it very much. I was looking for a dplyr-based solution because I have to call this function after a bunch of full_join, filtering and other data wrangling done using dplyr. So it seems natural to use dplyr as well. But this is cool and works perfectly.
@HernandoCasas You can combine dplyr functions before or after this chain (because the input and output is a data.frame). But you can't use it between the sequence.
3

Apparently there's a problem when using median (not sure what's going on there) but apart from that you can normally use an approach like the following with summarise_each to apply multiple functions. Note that you can specify the names of resulting columns by using a named vector as input to funs_():

x <- c(Right = "mean", Wrong = "sd", Unanswered = "sum")

toy_df %>% 
  group_by(group) %>% 
  summarise_each(funs_(x), value)

#Source: local data frame [26 x 4]
#
#   group     Right     Wrong Unanswered
#1      a 0.5038394 0.2846468   20.15358
#2      b 0.5048040 0.2994544   15.64892
#3      c 0.5029442 0.2465612   21.62660
#4      d 0.5124601 0.2681955   14.86134
#5      e 0.4649483 0.3075080   17.66804
#6      f 0.5622644 0.2850609   12.36982
#7      g 0.4675324 0.2746589   14.96104
#8      h 0.4921506 0.2879830   21.16248
#9      i 0.5443600 0.2945428   22.31876
#10     j 0.5276048 0.3236814   20.57659
#..   ...       ...       ...        ...

5 Comments

I don't think you need funs_ here. A "character vector of function names" should be enough. See e.g. the summarise_each(c("min", "max")) example. Weird indeed with median.
Thanks. For this particular example it works very nice. But in my real application I cannot call a different function for each of the values that I need to compute. It's my fault anyway. I wasn't clear enough that the function that I put in the post was just to have a reproducible example, but the function that I need to call on each group is much more complex and not just calls to mean, median, etc. Also, it's a function that I cannot change.
@HernandoCasas, no problem, it was clear that it was only an example. Maybe you can provide the function or clarify more precisely what you need
unfortunately I cannot post the function here. Not sure that it helps either. It's a function that compares the data to reference values in another database (uses RODBC to connect to the other database and obtain updated reference values that change daily) and return four values (in a named vector) that indicate the result of the comparison. But I cannot call a function to obtain each of these values (so, I cannot use summarise_each).
@docendodiscimus I posted an issue about "median"
1

using the sequence of list(as_tibble(as.list(...)) followed by an unnest from tidyr does the trick

toy_summary2 <- toy_df %>% group_by(group) %>% 
summarize(Col = list(as_tibble(as.list(toy_fn(value)))))  %>% unnest()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.