R aggregate with multiple arguments in function

Question

Im tryng to avoid a time consuming for loop by using an aggregate on a data.frame. But I need that the values of one of the columns enters in the final computation.

dat <- data.frame(key = c('a', 'b', 'a','b'), 
rate = c(0.5,0.4,1,0.6), 
v1 = c(4,0,3,1), 
v2 = c(2,0,9,4))

>dat
  key rate v1 v2
1   a  0.5  4  2
2   b  0.4  0  0
3   a  1.0  3  9
4   b  0.6  1  4

aggregate(dat[,-1], list(key=dat$key),  
    function(x, y=dat$rate){
        rates <- as.numeric(y)
        values <- as.numeric(x)
        return(sum(values*rates)/sum(rates))
    })

Note: The function is just an example!
The problem of this implementation is that y=dat$rate gives all 4 rates on dat, when what I want is just the 2 aggregated rates! Anny sugestion on how I could do this? Thanks!

did either of these answers work out for you?

A5C1D2H2I1M1N2O1R2T1
– A5C1D2H2I1M1N2O1R2T1

2012-10-31 10:08:57 +00:00
Commented Oct 31, 2012 at 10:08 — A5C1D2H2I1M1N2O1R2T1
– A5C1D2H2I1M1N2O1R2T1, Commented Oct 31, 2012 at 10:08

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer · 2012-10-26 05:43:51Z

5

Here's what I managed to achieve, using the "data.table" package:

DT <- data.table(dat, key = "key")
DT[, list(v1 = sum(rate * v1)/sum(rate), v2 = sum(rate * v2)/sum(rate)), by = "key"]
#    key       v1       v2
# 1:   a 3.333333 6.666667
# 2:   b 0.600000 2.400000

OK. So that's easy to write out for just two variables, but what about when we have a lot more columns. Use lapply(.SD,...) in conjunction with your function:

First, some data:

set.seed(1)
dat <- data.frame(key = rep(c("a", "b"), times = 10),
                  rate = runif(20, min = 0, max = 1),
                  v1 = sample(10, 20, replace = TRUE),
                  v2 = sample(20, 20, replace = TRUE),
                  v3 = sample(30, 20, replace = TRUE),
                  x1 = sample(5, 20, replace = TRUE),
                  x2 = sample(6:10, 20, replace = TRUE),
                  x3 = sample(11:15, 20, replace = TRUE))
library(data.table)
datDT <- data.table(dat, key = "key")
datDT
#     key       rate v1 v2 v3 x1 x2 x3
#  1:   a 0.26550866 10 17 28  3  9 15
#  2:   a 0.57285336  7 16 14  2  7 13
#  3:   a 0.20168193  3 11 20  4  9 14
#  4:   a 0.94467527  1  1 15  4  6 13
#  5:   a 0.62911404  9 15  3  2 10 12
#  6:   a 0.20597457  5 10 11  2 10 13
#  7:   a 0.68702285  5  9 11  4  7 11
#  8:   a 0.76984142  9  2 15  4  6 15
#  9:   a 0.71761851  8  7 26  3  9 13
# 10:   a 0.38003518  8 14 24  5  8 15
# 11:   b 0.37212390  3 13  9  4  7 13
# 12:   b 0.90820779  2 12 10  2 10 11
# 13:   b 0.89838968  4 16  8  2  7 13
# 14:   b 0.66079779  4 10 23  1  8 12
# 15:   b 0.06178627  4 14 27  1  8 13
# 16:   b 0.17655675  6 18 26  1  9 11
# 17:   b 0.38410372  2  5 11  5  8 14
# 18:   b 0.49769924  7  2 27  4  6 13
# 19:   b 0.99190609  2 11 12  3  6 13
# 20:   b 0.77744522  5  9 29  4  9 13

Second, aggregate:

datDT[, lapply(.SD, function(x, y = rate) sum(y * x)/sum(y)), by = "key"]
#    key      rate       v1        v2       v3       x1       x2       x3
# 1:   a 0.6501303 6.335976  8.634691 15.75915 3.363832 7.658762 13.19152
# 2:   b 0.7375793 3.595585 10.749705 16.26582 2.792390 7.741787 12.57301

If you have a really large dataset, you might want to explore data.table in general.

For what it is worth, I was also successful in base R, but I'm not sure how efficient this would be, particularly because of the transposing and so on.

t(sapply(split(dat, dat[1]), 
         function(x, y = 3:ncol(dat)) {
           V1 <- vector()
           for (i in 1:length(y)) {
             V1[i] <- sum(x[2] * x[y[i]])/sum(x[2])
           }
           V1
         }))
#       [,1]      [,2]     [,3]     [,4]     [,5]     [,6]
# a 6.335976  8.634691 15.75915 3.363832 7.658762 13.19152
# b 3.595585 10.749705 16.26582 2.792390 7.741787 12.57301

edited Oct 26, 2012 at 5:43

answered Oct 26, 2012 at 5:36

A5C1D2H2I1M1N2O1R2T1

194k31 gold badges417 silver badges497 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

mnel Over a year ago

I reckon that datDT[, lapply(.SD, function(x,y) { sum(x * y) / sum(y)}, y = rate), by = key][,setdiff(names(datDT), 'rate'), with = F] does the trick, and is slightly easier to understand

mnel Over a year ago

You can even replace the anonymous function with weighted.mean if you want

A5C1D2H2I1M1N2O1R2T1 Over a year ago

@mnel, I was fully expecting your input here ;) I was about to hit you up in chat for advice since I've just started exploring data.table a few days ago.

A5C1D2H2I1M1N2O1R2T1 Over a year ago

@mnel, regarding use of weighted.mean I thought it would be best to keep it as is considering the title of the OP's question.

mnel Over a year ago

:). Appending [,setdiff(names(datDT), 'rate'), with = F] will remove the rate column - this column is not particularly meaningful

|

Paul Hiemstra · Accepted Answer · 2012-10-25 11:15:09Z

3

One solution is to use ddply from the plyr package:

res = ddply(dat, .(key), summarise, result = sum(v1 * rate) / sum(rate))
> res
  key   result
1   a 3.333333
2   b 0.600000

If you want to apply this to all the v columns, I would recommend first changing your data structure a bit:

dat = melt(dat, id.vars = c("key", "rate"))
> dat
  key rate variable value
1   a  0.5       v1     4
2   b  0.4       v1     0
3   a  1.0       v1     3
4   b  0.6       v1     1
5   a  0.5       v2     2
6   b  0.4       v2     0
7   a  1.0       v2     9
8   b  0.6       v2     4

and then using ddply again:

res = ddply(dat, .(key, variable), summarise, result = sum(value * rate) / sum(rate))
> res
  key variable   result
1   a       v1 3.333333
2   a       v2 6.666667
3   b       v1 0.600000
4   b       v2 2.400000

...or is you need a standard R solution, you can use by:

res = by(dat, list(dat$key), function(x) sum(x$v1 * x$rate) / sum(x$rate))
> res
: a
[1] 3.333333
------------------------------------------------------------ 
: b
[1] 0.6

edited Oct 25, 2012 at 11:15

answered Oct 25, 2012 at 10:07

Paul Hiemstra

61.2k12 gold badges146 silver badges151 bronze badges

5 Comments

essv Over a year ago

Thank you for your answer, but is not exactly what I want!

essv Over a year ago

I have more columns, so this solution needs to be replicated for all V columns! With a cbind in the end (?)

Paul Hiemstra Over a year ago

I extended my answer to include this additional requirement.

essv Over a year ago

Trying to be more specific: Each V column must be the result of a function wich is dependent of V itself and the column rates, grouped by the key values.

essv Over a year ago

I was trying to avoid that reorganization of the data, because it will result on a table with 5.600.000 rows. And in the end, I will have to do the reverse to the original structure. But if there is not other way... Thank you for the help!

Collectives™ on Stack Overflow

R aggregate with multiple arguments in function

2 Answers 2

8 Comments

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

8 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related