I have the following data frame
data =
a b c val
0 'a1' 'b1' 'c1' 5
1 'a1' 'b1' 'c2' 10
2 'a1' 'b2' 'c3' 1
3 'a2' 'b1' 'c4' 30
4 'a2' 'b1' 'c5' 20
5 'a2' 'b1' 'c6' 10
I want to create a new data frame with same columns, where the value of a given (a,b,c) is equal to sum of the values of all (aa,bb, cc)'s if a==aa, b==bb and value(aa,bb,cc) >= value(a,b,c). Here aa is an alias for a, and so on.
The desired outcome is
data_new =
a b c val
0 'a1' 'b1' 'c1' 15
1 'a1' 'b1' 'c2' 10
2 'a1' 'b2' 'c3' 1
3 'a2' 'b1' 'c4' 30
4 'a2' 'b1' 'c5' 50
5 'a2' 'b1' 'c6' 60
And I do the following:
data['key'] = data['a']+data['b']
data_new = data
data_new = data.apply(lambda row: data[ (data.key==row.key)
& (data.val>=row.val) ].val.sum(), axis=1)
del data_new['key']
Actually, in my real data, I have more columns like a and b, that is why boolean comparison such as
(data.a==row.a) & (data.a==row.b) & ...
is slower. The thing is that even the current implementation is not as fast as I want. I have huge tables. Is there any way to do the same thing faster & more efficient?