2

Here is my question. Take the dataframe below as an example:

enter image description here

  • The dataframe df has 8 columns, each of them has finite values.
  • What I'm going to do:
    • a. Loop over the dataframe by rows
    • b. In each row, the value of column B1, B2, B3, B4, B5, B6 will be changed to B* x A

Code like this:

 for i in range(0,len(df),1):
     col_B = ["B1","B2","B3","B4","B5","B6",]
     for j in range(len(col_B)):
         df.[col_B[j]].iloc[i] = df.[col_B[j]].iloc[i]*df.A.iloc[i]  

In my real data which contain 224 rows and 9 columns, to loop over all these cells cost me 0:01:03.

How to boost up the loop-over velocity in Pandas?

Any advice would be appreciate.

1 Answer 1

2

You can first filter DataFrame and then multiple by mul:

print(df.filter(like='B').mul(df.A, axis=0))

Sample:

import pandas as pd
import numpy as np

df = pd.DataFrame({'A':[1,2,3],
                   'B1':[4,5,6],
                   'B2':[7,8,9],
                   'B3':[1,3,5],
                   'B4':[5,3,6],
                   'B5':[7,4,3],
                   'B6':[1,3,7]})

print (df)
   A  B1  B2  B3  B4  B5  B6
0  1   4   7   1   5   7   1
1  2   5   8   3   3   4   3
2  3   6   9   5   6   3   7

print(df.filter(like='B').mul(df.A, axis=0))
   B1  B2  B3  B4  B5  B6
0   4   7   1   5   7   1
1  10  16   6   6   8   6
2  18  27  15  18   9  21

If need column A use concat:

print (pd.concat([df.A, df.filter(like='B').mul(df.A, axis=0)], axis=1))
   A  B1  B2  B3  B4  B5  B6
0  1   4   7   1   5   7   1
1  2  10  16   6   6   8   6
2  3  18  27  15  18   9  21

Timings:

len(df)=3:

In [416]: %timeit (pd.concat([df.A, df.filter(like='B').mul(df.A, axis=0)], axis=1))
1000 loops, best of 3: 1.01 ms per loop

In [417]: %timeit loop(df)
100 loops, best of 3: 3.28 ms per loop

len(df)=30k:

In [420]: %timeit (pd.concat([df.A, df.filter(like='B').mul(df.A, axis=0)], axis=1))
The slowest run took 4.00 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 3 ms per loop

In [421]: %timeit loop(df)
1 loop, best of 3: 35.6 s per loop

Code for timings:

import pandas as pd

df = pd.DataFrame({'A':[1,2,3],
                   'B1':[4,5,6],
                   'B2':[7,8,9],
                   'B3':[1,3,5],
                   'B4':[5,3,6],
                   'B5':[7,4,3],
                   'B6':[1,3,7]})

print (df)

df = pd.concat([df]*10000).reset_index(drop=True)

print (pd.concat([df.A, df.filter(like='B').mul(df.A, axis=0)], axis=1))

def loop(df):
    for i in range(0,len(df),1):
         col_B = ["B1","B2","B3","B4","B5","B6",]
         for j in range(len(col_B)):
             df[col_B[j]].iloc[i] = df[col_B[j]].iloc[i]*df.A.iloc[i]  
    return df

print (loop(df))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.