8

Im tryng to avoid a time consuming for loop by using an aggregate on a data.frame. But I need that the values of one of the columns enters in the final computation.

dat <- data.frame(key = c('a', 'b', 'a','b'), 
rate = c(0.5,0.4,1,0.6), 
v1 = c(4,0,3,1), 
v2 = c(2,0,9,4))

>dat
  key rate v1 v2
1   a  0.5  4  2
2   b  0.4  0  0
3   a  1.0  3  9
4   b  0.6  1  4

aggregate(dat[,-1], list(key=dat$key),  
    function(x, y=dat$rate){
        rates <- as.numeric(y)
        values <- as.numeric(x)
        return(sum(values*rates)/sum(rates))
    })

Note: The function is just an example!
The problem of this implementation is that y=dat$rate gives all 4 rates on dat, when what I want is just the 2 aggregated rates! Anny sugestion on how I could do this? Thanks!

1
  • did either of these answers work out for you? Commented Oct 31, 2012 at 10:08

2 Answers 2

5

Here's what I managed to achieve, using the "data.table" package:

DT <- data.table(dat, key = "key")
DT[, list(v1 = sum(rate * v1)/sum(rate), v2 = sum(rate * v2)/sum(rate)), by = "key"]
#    key       v1       v2
# 1:   a 3.333333 6.666667
# 2:   b 0.600000 2.400000

OK. So that's easy to write out for just two variables, but what about when we have a lot more columns. Use lapply(.SD,...) in conjunction with your function:

First, some data:

set.seed(1)
dat <- data.frame(key = rep(c("a", "b"), times = 10),
                  rate = runif(20, min = 0, max = 1),
                  v1 = sample(10, 20, replace = TRUE),
                  v2 = sample(20, 20, replace = TRUE),
                  v3 = sample(30, 20, replace = TRUE),
                  x1 = sample(5, 20, replace = TRUE),
                  x2 = sample(6:10, 20, replace = TRUE),
                  x3 = sample(11:15, 20, replace = TRUE))
library(data.table)
datDT <- data.table(dat, key = "key")
datDT
#     key       rate v1 v2 v3 x1 x2 x3
#  1:   a 0.26550866 10 17 28  3  9 15
#  2:   a 0.57285336  7 16 14  2  7 13
#  3:   a 0.20168193  3 11 20  4  9 14
#  4:   a 0.94467527  1  1 15  4  6 13
#  5:   a 0.62911404  9 15  3  2 10 12
#  6:   a 0.20597457  5 10 11  2 10 13
#  7:   a 0.68702285  5  9 11  4  7 11
#  8:   a 0.76984142  9  2 15  4  6 15
#  9:   a 0.71761851  8  7 26  3  9 13
# 10:   a 0.38003518  8 14 24  5  8 15
# 11:   b 0.37212390  3 13  9  4  7 13
# 12:   b 0.90820779  2 12 10  2 10 11
# 13:   b 0.89838968  4 16  8  2  7 13
# 14:   b 0.66079779  4 10 23  1  8 12
# 15:   b 0.06178627  4 14 27  1  8 13
# 16:   b 0.17655675  6 18 26  1  9 11
# 17:   b 0.38410372  2  5 11  5  8 14
# 18:   b 0.49769924  7  2 27  4  6 13
# 19:   b 0.99190609  2 11 12  3  6 13
# 20:   b 0.77744522  5  9 29  4  9 13

Second, aggregate:

datDT[, lapply(.SD, function(x, y = rate) sum(y * x)/sum(y)), by = "key"]
#    key      rate       v1        v2       v3       x1       x2       x3
# 1:   a 0.6501303 6.335976  8.634691 15.75915 3.363832 7.658762 13.19152
# 2:   b 0.7375793 3.595585 10.749705 16.26582 2.792390 7.741787 12.57301

If you have a really large dataset, you might want to explore data.table in general.


For what it is worth, I was also successful in base R, but I'm not sure how efficient this would be, particularly because of the transposing and so on.

t(sapply(split(dat, dat[1]), 
         function(x, y = 3:ncol(dat)) {
           V1 <- vector()
           for (i in 1:length(y)) {
             V1[i] <- sum(x[2] * x[y[i]])/sum(x[2])
           }
           V1
         }))
#       [,1]      [,2]     [,3]     [,4]     [,5]     [,6]
# a 6.335976  8.634691 15.75915 3.363832 7.658762 13.19152
# b 3.595585 10.749705 16.26582 2.792390 7.741787 12.57301
Sign up to request clarification or add additional context in comments.

8 Comments

I reckon that datDT[, lapply(.SD, function(x,y) { sum(x * y) / sum(y)}, y = rate), by = key][,setdiff(names(datDT), 'rate'), with = F] does the trick, and is slightly easier to understand
You can even replace the anonymous function with weighted.mean if you want
@mnel, I was fully expecting your input here ;) I was about to hit you up in chat for advice since I've just started exploring data.table a few days ago.
@mnel, regarding use of weighted.mean I thought it would be best to keep it as is considering the title of the OP's question.
:). Appending [,setdiff(names(datDT), 'rate'), with = F] will remove the rate column - this column is not particularly meaningful
|
3

One solution is to use ddply from the plyr package:

res = ddply(dat, .(key), summarise, result = sum(v1 * rate) / sum(rate))
> res
  key   result
1   a 3.333333
2   b 0.600000

If you want to apply this to all the v columns, I would recommend first changing your data structure a bit:

dat = melt(dat, id.vars = c("key", "rate"))
> dat
  key rate variable value
1   a  0.5       v1     4
2   b  0.4       v1     0
3   a  1.0       v1     3
4   b  0.6       v1     1
5   a  0.5       v2     2
6   b  0.4       v2     0
7   a  1.0       v2     9
8   b  0.6       v2     4

and then using ddply again:

res = ddply(dat, .(key, variable), summarise, result = sum(value * rate) / sum(rate))
> res
  key variable   result
1   a       v1 3.333333
2   a       v2 6.666667
3   b       v1 0.600000
4   b       v2 2.400000

...or is you need a standard R solution, you can use by:

res = by(dat, list(dat$key), function(x) sum(x$v1 * x$rate) / sum(x$rate))
> res
: a
[1] 3.333333
------------------------------------------------------------ 
: b
[1] 0.6

5 Comments

Thank you for your answer, but is not exactly what I want!
I have more columns, so this solution needs to be replicated for all V columns! With a cbind in the end (?)
I extended my answer to include this additional requirement.
Trying to be more specific: Each V column must be the result of a function wich is dependent of V itself and the column rates, grouped by the key values.
I was trying to avoid that reorganization of the data, because it will result on a table with 5.600.000 rows. And in the end, I will have to do the reverse to the original structure. But if there is not other way... Thank you for the help!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.