Using dplyr summarize with different operations for multiple columns

Question

Well, I know that there are already tons of related questions, but none gave an answer to my particular need.

I want to use dplyr "summarize" on a table with 50 columns, and I need to apply different summary functions to these.

"Summarize_all" and "summarize_at" both seem to have the disadvantage that it's not possible to apply different functions to different subgroups of variables.

As an example, let's assume the iris dataset would have 50 columns, so we do not want to address columns by names. I want the sum over the first two columns, the mean over the third and the first value for all remaining columns (after a group_by(Species)). How could I do this?

not sure I get it right, but referencing directly to column numbers like here or extracting the colnames and use these maybe ? — R. Prost
– R. Prost, Commented Feb 23, 2018 at 9:09
Welcome to Stack Overflow, in order to find help here, please consider how to write a reproducible example, thank you. — jay.sf
– jay.sf, Commented Feb 23, 2018 at 9:24
What's with people just repeating guidelines verbatim. The question is quite clear. — zola25
– zola25, Commented Dec 18, 2018 at 10:23

Agile Bean · Accepted Answer · 2020-05-20 09:03:13Z

18

Fortunately, there is a much simpler way available now. With the new dplyr 1.0.0 coming out soon, you can leverage the across function for this purpose.

All you need to type is:

iris %>% 
  group_by(Species) %>% 
  summarize(
    # I want the sum over the first two columns, 
    across(c(1,2), sum),
    #  the mean over the third 
    across(3, mean),
    # the first value for all remaining columns (after a group_by(Species))
    across(-c(1:3), first)
  )

Great, isn't it? I first thought the across is not necessary as the scoped variants worked just fine, but this use case is exactly why the across function can be very beneficial.

You can get the latest version of dplyr by devtools::install_github("tidyverse/dplyr")

answered May 20, 2020 at 9:03

Agile Bean

7,4411 gold badge53 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Fustincho · Accepted Answer · 2018-02-28 14:53:07Z

5

As other people have mentioned, this is normally done by calling summarize_each / summarize_at / summarize_if for every group of columns that you want to apply the summarizing function to. As far as I know, you would have to create a custom function that performs summarizations to each subset. You can for example set the colnames in such way that you can use the select helpers (e.g. contains()) to filter just the columns that you want to apply the function to. If not, then you can set the specific column numbers that you want to summarize.

For the example you mentioned, you could try the following:

summarizer <- function(tb, colsone, colstwo, colsthree, 
                       funsone, funstwo, funsthree, group_name) {

return(bind_cols(
    summarize_all(select(tb, colsone), .funs = funsone),
    summarize_all(select(tb, colstwo), .funs = funstwo) %>% 
      ungroup() %>% select(-matches(group_name)),
    summarize_all(select(tb, colsthree), .funs = funsthree) %>% 
      ungroup() %>% select(-matches(group_name)) 
))

}

#With colnames
iris %>% as.tibble() %>% 
  group_by(Species) %>% 
  summarizer(colsone = contains("Sepal"), 
         colstwo = matches("Petal.Length"), 
         colsthree = c(-contains("Sepal"), -matches("Petal.Length")),
         funsone = "sum", 
         funstwo = "mean",
         funsthree = "first",
         group_name = "Species")

#With indexes
iris %>% as.tibble() %>% 
 group_by(Species) %>% 
 summarizer(colsone = 1:2, 
         colstwo = 3, 
         colsthree = 4,
         funsone = "sum", 
         funstwo = "mean",
         funsthree = "first",
         group_name = "Species")

answered Feb 28, 2018 at 14:53

Fustincho

4232 silver badges10 bronze badges

2 Comments

CodingButStillAlive Over a year ago

Great! That helped me a lot and worked perfectly. Thanks!!

dez93_2000 Over a year ago

note for others: for additional arguments to functions, you can add them to the function call e.g. ".funs = funsone, na.rm = T),"

user8054146 · Accepted Answer · 2018-02-23 09:31:29Z

1

You could summarise the data with each function separately and then join the data later if needed.

So something like this for the iris example:

sums <- iris %>% group_by(Species) %>% summarise_at(1:2, sum)
means <- iris %>% group_by(Species) %>% summarise_at(3, mean)
firsts <- iris %>% group_by(Species) %>% summarise_at(4, first)
full_join(sums, means) %>% full_join(firsts)

Though I would try to think of something else if there are more than a handful of summarising functions you need to use.

answered Feb 23, 2018 at 9:31

user8054146

Comments

tushaR · Accepted Answer · 2018-02-23 09:40:03Z

0

Try this:

library(plyr)
library(dplyr)

dataframe <- data.frame(var = c(1,1,1,2,2,2),var2 = c(10,9,8,7,6,5),var3=c(2,3,4,5,6,7),var4=c(5,5,3,2,4,2))
dataframe

#  var var2 var3 var4
#1   1   10    2    5
#2   1    9    3    5
#3   1    8    4    3
#4   2    7    5    2
#5   2    6    6    4
#6   2    5    7    2

funnames<-c(sum,mean,first)
colnums<-c(2,3,4)
ddply(.data = dataframe,.variables = "var",
    function(x,funcs,inds){
        mapply(function(func,ind){
            func(x[,ind])
        },funcs,inds)
    },funnames,colnums)

#  var V1 V2 V3
#1   1 27  3  5
#2   2 18  6  2

answered Feb 23, 2018 at 9:40

tushaR

3,1161 gold badge24 silver badges36 bronze badges

Comments

dez93_2000 · Accepted Answer · 2020-05-08 00:08:29Z

-1

See this - feature coming soon

answered May 8, 2020 at 0:08

dez93_2000

1,8954 gold badges25 silver badges36 bronze badges

Collectives™ on Stack Overflow

Using dplyr summarize with different operations for multiple columns

5 Answers 5

Comments

2 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

2 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related