1

I have a pandas DataFrame

df = pd.DataFrame({"A": [10,20,30], "B": [20, 30, 10], "C": [20, 30, 10]})
df
    A   B  C
0   10  20 20
1   20  30 30
2   30  10 10

and another ndarray w = array([0.2, 0.3, 0.4])

how do I add column D such that its value is dot product of each row and w

i.e. the value for D[0] will be np.dot(df.iloc[0],w) = 16

likewise, value for D[1] is 25 (np.dot(df.iloc[1],w) = 25.

(I am thinking apply() function but not sure how to use it, using for loop might be inefficient)

thanks,

2 Answers 2

4

You can do that by using the apply over rows (axis = 1) from pandas.DataFrame

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({"A": [10,20,30], "B": [20, 30, 10], "C": [20, 30, 10]})
>>> w = np.array([0.2, 0.3, 0.4])
>>> df["D"] = df.apply(lambda p: np.dot(p.values, w), axis=1)
>>> df
    A   B   C     D
0  10  20  20  16.0
1  20  30  30  25.0
2  30  10  10  13.0

Although, for efficiency sake, you probably are better off turning the dataframe into a ndarray, and use matrix multiplication with matmul from numpy.

df["D"] = np.matmul(df.values, w)
Sign up to request clarification or add additional context in comments.

7 Comments

perfect! yeh I think I had a brain freeze but that's what I was after. thanks again.
thanks I like the edit better. now to make it a bit complicated - what if I had separate W for each row?
You could either pandas apply, and do the dot product with the corresponding array, or numpy matmul and then get the diagonal of the matrix. Performance-wise I have no idea which one is better. Example: repl.it/repls/PrivateBubblyOrganization
Another approach would be to use something similar to what @FBruzzesi is suggesting. If your df as the same shape as w2, you can directly multiply them, and sum all the rows like this: (df * w2).sum(axis=1)
thanks that was really useful, my original problem is a lot more complex, I have about 2000 different Ws each of 9*9 dimensions and the df is about 2000 * 9! I liked the matmul approach but will have to tailor to my needs given the size
|
2

You can also use a vectorize approach exploiting numpy broadcast:

df['D'] = np.sum(df.to_numpy() * w), axis=1)
'''
.to_numpy() is from version 0.24 if I remember correctly, before use .values
'''

df
    A   B   C     D
0  10  20  20  16.0
1  20  30  30  25.0
2  30  10  10  13.0

Doing perfomance analysis in spyder editor using %timeit, here what I got ordered from slowest to fastest:

%timeit (df * w).sum(axis=1)
2.15 ms ± 590 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.apply(lambda p: np.dot(p.values, w), axis=1)
900 µs ± 76.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit np.sum((df.to_numpy() * w), axis=1)
19.2 µs ± 481 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

5 Comments

thanks @FBruzzesi, that was helpful. my original problem is a lot more complex. I have about 2000 different Ws each of 9*9 dimensions and the df is about 2000 * 9!
That's when speed is important :)
btw what if my W was a 3*3 dimension (ie W1 = np.random.rand(3,3), W2 = W1*2, W3 = W2*2) and the value of D is calculated as np.dot(row, np.dot(W,rowT). how would you incorporate that using your suggested formula. I think I am able to get it using .apply() but curious to know if there is a faster way to do that.
so individual rows will gave its own W for calculations.
I think it is possible to use broadcasting anyway, however it gets messy and hard to debug really quickly, while np.dot(row, np.dot(W,rowT) is immediate to understand.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.