Summarize different Columns with different Functions

Question

I have the following Problem: In a data frame I have a lot of rows and columns with the first row being the date. For each date I have more than 1 observation and I want to summarize them.

My df looks like that (date replaced by ID for ease of use):

df:
ID     Cash    Price    Weight   ...
1      0.4     0        0
1      0.2     0        82       ...
1      0       1        0        ...
1      0       3.2      80       ...
2      0.3     1        70       ...
...    ...     ...      ...      ...

I want to group them by the first column and then summarize all rows BUT with different functions:

The function Cash and Price should be sum so I get the sum of Cash and Price for each ID. The function on Weight should be max so I only get the maximum weight for the ID.

Because I have so many columns I can not write a all functions by hand, but I have only 2 columns which should be summarized by max the rest should be summarized by sum.

So I am looking for a function to group by ID, summarize all with sum except 2 different columns which I need the max value.

I tried to use the dplyr package with:

df %>% group_by(ID = tolower(ID)) %>% summarise_each(funs(sum))

But I need the addition to not sum but max the 2 specified columns, any Ideas?

To be clear, the output of the example df should be:

ID     Cash     Price    Weight
1       0.6        4.2       82     
2       0.3        1          70

Edward Cox · Accepted Answer · 2021-02-11 14:24:09Z

7

As of dplyr 1.0.0 you can use across():

tribble(
  ~ID, ~max1, ~max2, ~sum1, ~sum2, ~sum3,
  1, 1, 1, 1, 2, 3,
  1, 2, 3, 1, 2, 3,
  2, 1, 1, 1, 2, 3,
  2, 3, 4, 2, 3, 4,
  3, 1, 1, 1, 2, 3,
  3, 4, 5, 3, 4, 5,
  3, NA, NA, NA, NA, NA
) %>%
  group_by(ID) %>%
  summarize(
    across(matches("max1|max2"), max, na.rm = T),
    across(!matches("max1|max2"), sum, na.rm = T)
  )

# ID  max1  max2  sum1  sum2  sum3
#  1     2     3     2     4     6
#  2     3     4     3     5     7
#  3     4     5     4     6     8

answered Feb 11, 2021 at 14:24

Edward Cox

711 silver badge1 bronze badge

Sign up to request clarification or add additional context in comments.

1 Comment

Steve Powell Over a year ago

just to point out you could also replace the second across line with this: across(everything(), sum, na.rm = T)

akrun · Accepted Answer · 2016-04-24 11:58:26Z

4

We can use

 df %>%
    group_by(ID) %>%
    summarise(Cash = sum(Cash), Price = sum(Price), Weight = max(Weight))

If we have many columns, one way would be to do this separately and then join the output together.

 df1 <- df %>% 
          group_by(ID) %>% 
          summarise_each(funs(sum), Cash:Price)
 df2 <- df %>%
          group_by(ID) %>% 
          summarise_each(funs(max), Weight)
 inner_join(df1, df2, by = "ID")
 #      ID  Cash Price Weight
 #   (int) (dbl) (dbl)  (int)
 #1     1   0.6   4.2     82
 #2     2   0.3   1.0     70

edited Apr 24, 2016 at 11:58

answered Apr 24, 2016 at 11:51

akrun

891k38 gold badges590 silver badges700 bronze badges

Comments

Kunal Puri · Accepted Answer · 2016-04-24 12:05:36Z

4

library(data.table)

setDT(df)

df[,.(Cash = sum(Cash),Price = sum(Price),Weight = max(Weight)),by=ID]

One way of doing this for +90 columns can be:

max_col <- 'Weight'

sum_col <- setdiff(colnames(df),max_col)

query_1 <- paste0(sum_col,' = sum(',sum_col,')')

query_2 <- paste0(max_col,' = max(',max_col,')')

query_3 <- paste(query_1,collapse=',')

query_4 <- paste(query_2,collapse=',')

query_5 <- paste(query_3,query_4,sep=',')

final_query <- paste0('df[,.(',query_5,'),by = ID]')

eval(parse(text = final_query))

edited Apr 24, 2016 at 12:05

answered Apr 24, 2016 at 11:50

Kunal Puri

3,4271 gold badge12 silver badges22 bronze badges

3 Comments

Max Over a year ago

Thanks, this seems to work! Is there an addition to apply sum to all columns except 'Weight' and only use max for the column 'Weight' and another one? Because I have 90+ columns, this would be a pain to write :)

Kunal Puri Over a year ago

@Max That's one of the ways you can accomplish the task using data.table

tlhenvironment Over a year ago

still haven't found another way to do this

hrbrmstr · Accepted Answer · 2016-04-24 12:06:44Z

Or do it w/o the double groups:

library(dplyr)

set.seed(1492)
df <- data.frame(id=rep(c(1,2), 3),
                 cash=rnorm(6, 0.5, 0.1),
                 price=rnorm(6, 0.5, 0.1)*6,
                 weight=sample(100, 6))

df

##   id      cash    price weight
## 1  1 0.4410152 2.484082     10
## 2  2 0.4101343 3.032529     93
## 3  1 0.3375889 2.305076     58
## 4  2 0.6047922 3.248851     55
## 5  1 0.4721711 3.209930     34
## 6  2 0.5362493 2.331530     99

custom_summarise <- function(do_df) {

  return(bind_cols(
    summarise_each(select(do_df, -weight), funs(sum)),
    summarise_each(select(do_df, weight), funs(max))
  ))

}

group_by(df, id) %>% do(custom_summarise(.))

## Source: local data frame [2 x 4]
## Groups: id [2]
## 
##      id     cash    price weight
##   (dbl)    (dbl)    (dbl)  (int)
## 1     3 1.250775 7.999089     58
## 2     6 1.551176 8.612910     99

Lala La · Accepted Answer · 2019-08-05 15:16:56Z

Here is a solution based on this comment on an issue on dplyr repo. I think it's very general to be applied to more complicated cases.

library(tidyverse)
df <- tribble(
      ~ID,   ~Cash,  ~Price,  ~Weight,
      #----------------------
          'a',      4,   6,   8,
          'a',      7,   3,   0,
          'a',      7,   9,   0,
          'b',      2,   8,   8,
          'b',      5,   1,   8,
          'b',      8,   0,   1,
          'c',      2,   1,   1,
          'c',      3,   8,   0,
          'c',      1,   9,   1
     )

out <- list(.vars=lst(vars(-Weight), vars(Weight)),
            .funs=lst(sum, max))%>% 
  pmap(~df%>%group_by(ID)%>%summarise_at(.x, .y)) %>% 
  reduce(inner_join)
out
# A tibble: 3 x 4
#   ID     Cash Price Weight
#   <chr> <dbl> <dbl>  <dbl>
# 1 a        18    18      8
# 2 b        15     9      8
# 3 c         6    18      1

You should specify the vars in the first lst (e.g. vars(-Weight), vars(Weight)) and respective function to be applied in the lst (sum, max). The .x in the summarise_at argument refers to elements in the variable lst, and .y refers to the elements in the function lst.

Collectives™ on Stack Overflow

Summarize different Columns with different Functions

5 Answers 5

1 Comment

Comments

3 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

Comments

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related