Dataframe - how to run calculations without using for loop?

Question

I have a pandas DataFrame

df = pd.DataFrame({"A": [10,20,30], "B": [20, 30, 10], "C": [20, 30, 10]})

df
    A   B  C
0   10  20 20
1   20  30 30
2   30  10 10

and another ndarray w = array([0.2, 0.3, 0.4])

how do I add column D such that its value is dot product of each row and w

i.e. the value for D[0] will be np.dot(df.iloc[0],w) = 16

likewise, value for D[1] is 25 (np.dot(df.iloc[1],w) = 25.

(I am thinking apply() function but not sure how to use it, using for loop might be inefficient)

thanks,

MkWTF · Accepted Answer · 2020-01-11 17:08:51Z

4

You can do that by using the apply over rows (axis = 1) from pandas.DataFrame

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({"A": [10,20,30], "B": [20, 30, 10], "C": [20, 30, 10]})
>>> w = np.array([0.2, 0.3, 0.4])
>>> df["D"] = df.apply(lambda p: np.dot(p.values, w), axis=1)
>>> df
    A   B   C     D
0  10  20  20  16.0
1  20  30  30  25.0
2  30  10  10  13.0

Although, for efficiency sake, you probably are better off turning the dataframe into a ndarray, and use matrix multiplication with matmul from numpy.

df["D"] = np.matmul(df.values, w)

edited Jan 11, 2020 at 17:08

answered Jan 11, 2020 at 17:04

MkWTF

1,3827 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

TRex Over a year ago

perfect! yeh I think I had a brain freeze but that's what I was after. thanks again.

TRex Over a year ago

thanks I like the edit better. now to make it a bit complicated - what if I had separate W for each row?

MkWTF Over a year ago

You could either pandas apply, and do the dot product with the corresponding array, or numpy matmul and then get the diagonal of the matrix. Performance-wise I have no idea which one is better. Example: repl.it/repls/PrivateBubblyOrganization

MkWTF Over a year ago

Another approach would be to use something similar to what @FBruzzesi is suggesting. If your df as the same shape as w2, you can directly multiply them, and sum all the rows like this: (df * w2).sum(axis=1)

TRex Over a year ago

thanks that was really useful, my original problem is a lot more complex, I have about 2000 different Ws each of 9*9 dimensions and the df is about 2000 * 9! I liked the matmul approach but will have to tailor to my needs given the size

|

FBruzzesi · Accepted Answer · 2020-01-11 17:34:41Z

2

You can also use a vectorize approach exploiting numpy broadcast:

df['D'] = np.sum(df.to_numpy() * w), axis=1)
'''
.to_numpy() is from version 0.24 if I remember correctly, before use .values
'''

df
    A   B   C     D
0  10  20  20  16.0
1  20  30  30  25.0
2  30  10  10  13.0

Doing perfomance analysis in spyder editor using %timeit, here what I got ordered from slowest to fastest:

%timeit (df * w).sum(axis=1)
2.15 ms ± 590 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.apply(lambda p: np.dot(p.values, w), axis=1)
900 µs ± 76.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit np.sum((df.to_numpy() * w), axis=1)
19.2 µs ± 481 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

edited Jan 11, 2020 at 17:34

answered Jan 11, 2020 at 17:20

FBruzzesi

6,6133 gold badges19 silver badges42 bronze badges

5 Comments

TRex Over a year ago

thanks @FBruzzesi, that was helpful. my original problem is a lot more complex. I have about 2000 different Ws each of 9*9 dimensions and the df is about 2000 * 9!

FBruzzesi Over a year ago

That's when speed is important :)

TRex Over a year ago

btw what if my W was a 3*3 dimension (ie W1 = np.random.rand(3,3), W2 = W1*2, W3 = W2*2) and the value of D is calculated as np.dot(row, np.dot(W,rowT). how would you incorporate that using your suggested formula. I think I am able to get it using .apply() but curious to know if there is a faster way to do that.

TRex Over a year ago

so individual rows will gave its own W for calculations.

FBruzzesi Over a year ago

I think it is possible to use broadcasting anyway, however it gets messy and hard to debug really quickly, while np.dot(row, np.dot(W,rowT) is immediate to understand.

Collectives™ on Stack Overflow

Dataframe - how to run calculations without using for loop?

2 Answers 2

7 Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related