3

I have 12000 csv files every file have 6000 rows . i am using for loop in my code , i think because of this my code execution time increased. if anyone know how to change this piece of code in to pandas package that reduce execution time

for i in range(len(df)):
        if ((df['EOG_Start_model'].values[i]-df['EOG_Min_model'].values[i])<(df['EOG_start_farm'].values[i]-df['EOG_Min_Farm'].values[i])) &((df['EOG_Max_model'].values[i]-df['EOG_Min_model'].values[i])<(df['EOG_Max_Farm'].values[i]-df['EOG_Min_Farm'].values[i]))&((df['Avg'].values[i]>2)):
          #print('EOG')
          df['EOG_flag'].values[i]=1

        if ((df['EOG_Max_model'].values[i]-df['EOG_Min_model'].values[i])<(df['EOG_Max_Farm'].values[i]-df['EOG_Min_Farm'].values[i]))&((df['Avg'].values[i]>2)):
            #print('gust')
            df['Gust_flag'].values[i]=1

Note: this code is working well , just execution time is high

1 Answer 1

3

You can use vectorized solution - craete boolean mask separately, chain together by & and set values in numpy.where:

x = df['EOG_start_farm'].values-df['EOG_Min_Farm'].values
m1 = (df['EOG_Start_model'].values-df['EOG_Min_model'].values) < x
m2 = (df['EOG_Max_model'].values-df['EOG_Min_model'].values) < x
m3 = df['Avg'].values > 2
m23 = m2 & m3

df['EOG_flag'] = np.where(m1 & m2 & m3, 1, df['EOG_flag'].values)
df['Gust_flag'] = np.where(m2 & m3, 1, df['Gust_flag'].values)

Performance:

np.random.seed(2019)

N = 6000
c = ['EOG_Start_model','EOG_Min_model','EOG_start_farm','EOG_Min_Farm','EOG_Max_model',
     'EOG_Max_Farm','Avg','EOG_flag','Gust_flag']
df = pd.DataFrame(np.random.rand(N, 9), columns=c)
print (df)

In [91]: %%timeit
    ...: x = df['EOG_start_farm'].values-df['EOG_Min_Farm'].values
    ...: m1 = (df['EOG_Start_model'].values-df['EOG_Min_model'].values) < x
    ...: m2 = (df['EOG_Max_model'].values-df['EOG_Min_model'].values) < x
    ...: m3 = df['Avg'].values > 2
    ...: m23 = m2 & m3
    ...: 
    ...: df['EOG_flag'] = np.where(m1 & m2 & m3, 1, df['EOG_flag'].values)
    ...: df['Gust_flag'] = np.where(m2 & m3, 1, df['Gust_flag'].values)
    ...: 
597 µs ± 6.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [93]: %%timeit
    ...: for i in range(len(df)):
    ...:     if ((df['EOG_Start_model'].values[i]-df['EOG_Min_model'].values[i])<(df['EOG_start_farm'].values[i]-df['EOG_Min_Farm'].values[i])) &((df['EOG_Max_model'].values[i]-df['EOG_Min_model'].values[i])<(df['EOG_Max_Farm'].values[i]-df['EOG_Min_Farm'].values[i]))&((df['Avg'].values[i]>2)):
    ...:           #print('EOG')
    ...:           df['EOG_flag'].values[i]=1
    ...: 
    ...:     if ((df['EOG_Max_model'].values[i]-df['EOG_Min_model'].values[i])<(df['EOG_Max_Farm'].values[i]-df['EOG_Min_Farm'].values[i]))&((df['Avg'].values[i]>2)):
    ...:             #print('gust')
    ...:             df['Gust_flag'].values[i]=1
231 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Sign up to request clarification or add additional context in comments.

3 Comments

this code have same result as for loop , there is no noticeable execution time,almost same execution time
@Nickel - OK, add some tests.
@Nickel - It is 387 times faster like original solution.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.