How to vectorise this for loop in Python?

Question

My data has 4 columns A-D which have integers. I am adding a new column E, which takes its first value same as first value in column D. The next value in E should be the corresponding value in column D if previous value in column E is negative, otherwise it takes corresponding value in column C.

import pandas as pd
import numpy as np
from pandas import Series, DataFrame
data=pd.read_excel('/Users/xxxx/Documents/PY Notebooks/Data/yyyy.xlsx')
data1=data.copy()
data1['E']=np.nan
data1.at[0,'E']=data1['D'][0]
l=len(data1)
for i in range(l-1):
    if data1['E'][i]<0:
        data1.at[i+1,'E']=data1['D'][i+1]
    else:
        data1.at[i+1,'E']=data1['C'][i+1]

This'll be difficult to vectorize in the SIMD sense since each iteration's calculation depends on the previous calculation. — Anonymous1847
– Anonymous1847, Commented Nov 3, 2020 at 17:58
are there any assumptions about the data, e.g. D and E cannot be both negative? — Marat
– Marat, Commented Nov 3, 2020 at 17:59
It would be easier to get an answer if you provide both sample data and expected output. — Henry Yik
– Henry Yik, Commented Nov 3, 2020 at 18:45

Bill Huang · Accepted Answer · 2020-11-03 19:49:25Z

1

TL;DR: go to the benchmark code and use Method 1.

Short Answer

No. Vectorization is not possible.

Long Answer

Theorem: For this particular task, the output of a given row cannot be determined using any finite length of backward rolling window smaller than the partial length up to this row.

Thus, there is no way for this output logic to be processed in a vectorized way. (See this answer for an idea of vectorization is performed in CPUs). The output can only be computed from the beginning of the dataframe.

Proof: Consider a target row of a dataframe df. Assume there is a backward rolling window with size n < partial length, so a previous value of df["E"] exists before the window. We denote this previous value by state.

Consider a special case: df["C"] == -1 and df["D"] == 1 within the window.

Case 1 (state < 0): The output within this rolling window will be [1, -1, 1, -1, .....], making the last element (-1)^(n-1)
Case 2 (state >= 0): The output will be [-1, 1, -1, 1, .....], making the last element (-1)^(n)

Therefore, it is possible for the output df["E"] of the target row to be dependent on a state variable outside the window. QED.

Useful Answer

Although vectorization is impossible, it does not mean that significant acceleration cannot be achieved. A simple yet very efficient approach is using a numba-compiled generator to perform the sequential generation. It only requires re-writing your logic into a generator function and add two additional lines:

import numba

@numba.njit
def my_generator_func():
    ....

Of course, you may have to install numba first. If this is not possible, then using a plain generator without numba optimization is also fine.

Benchmark

The benchmark is performed on a i5-8250U (4C8T) laptop with 16GB RAM running 64-bit debian 10. Python version is 3.7.9 and pandas is 1.1.3. n = 10^7 (10 million) records are generated for benchmarking purposes.

Result:

1. numba-njit: 2.48s
2. plain generator (no numba): 5.13s
3. original: 271.15s

> 100x efficiency gain can be achieved against the original code.

Code

from datetime import datetime
import pandas as pd
import numpy as np

n = 10000000  # a large number of rows
df = pd.DataFrame({"C": -np.ones(n), "D": np.ones(n)})
#print(df.head())

# ========== Method 1. generator + numba njit ==========
ti = datetime.now()

import numba

@numba.njit
def gen(plus: np.array, minus: np.array):
    l = len(plus)
    assert len(minus) == l
    # first
    state = minus[0]
    yield state
    # second to last
    for i in range(l-1):
        state = minus[i+1] if state < 0 else plus[i+1]
        yield state

df["E"] = [i for i in gen(df["C"].values, df["D"].values)]

tf = datetime.now()
print(f"1. numba-njit: {(tf-ti).total_seconds():.2f}s")  # 1. numba-njit: 0.47s

# ========== Method 2. Generator without numba ==========
df = pd.DataFrame({"C": -np.ones(n), "D": np.ones(n)})
ti = datetime.now()

def gen_plain(plus: np.array, minus: np.array):
    l = len(plus)
    assert len(minus) == l
    # first
    state = minus[0]
    yield state
    # second to last
    for i in range(l-1):
        state = minus[i+1] if state < 0 else plus[i+1]
        yield state

df["E"] = [i for i in gen_plain(df["C"].values, df["D"].values)]

tf = datetime.now()
print(f"2. plain generator (no numba): {(tf-ti).total_seconds():.2f}s")  #

# ========== Method 3. Direct iteration ==========
df = pd.DataFrame({"C": -np.ones(n), "D": np.ones(n)})
ti = datetime.now()

# code provided by the OP
df['E']=np.nan
df.at[0,'E'] = df['D'][0]
l=len(df)
for i in range(l - 1):
    if df['E'][i] < 0:
        df.at[i+1,'E'] = df['D'][i+1]
    else:
        df.at[i+1,'E'] = df['C'][i+1]

tf = datetime.now()
print(f"3. original: {(tf-ti).total_seconds():.2f}s") # 2. 26.61s

edited Nov 3, 2020 at 19:49

answered Nov 3, 2020 at 19:39

Bill Huang

4,6772 gold badges15 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Marat Over a year ago

the proof has nothing to do with the feasibility of vectorization, or based on the wrong definition of vectorization. Counterexample: by this proof cumulative sum is not vectorizeable

Bill Huang Over a year ago

Alternative perspective may also stand a point. But IMHO, if a recursive function like cumsum, under any definition, could be consider vectorized, then any re-implementation which is more efficient than simple for-looping could also be claimed vectorized. Such definition would have no logical boundary so would not make strict sense. Therefore, I will stick to a strict definition based on whether it is possible to compute the result using vectorized CPU instructions in parallel and NOT in sequential order. (cont'd)

Bill Huang Over a year ago

Such definition provides a clear logic, against which the feasibility of vectorization of a given algorithm can be examined. From this perspective, cumsum is in fact the best example of an efficient library-built-in method that is not vectorized (in the strict sense). I won't go more on this topic. Anyone who does not agree can of course express their opinion freely.

mlang · Accepted Answer · 2020-11-03 19:38:41Z

I don't think you can vectorize this operation as you have dependent rows that need to get previous calculations being run. This being said, there still is quite some room for optimization in your functionality. Let's first check your original implementation with some random points.

import numpy as np
import pandas as pd
import time

size = 10000000
data = np.random.randint(-2, 10, size=size)
data = data.reshape([size//4, 4])

time_start = time.time()
df = pd.DataFrame(data=data, columns=["A", "B", "C", "D"])
df['E']=np.nan
df.at[0,'E'] = df['D'][0]
for i in range(len(df)-1):
    if df['E'][i]<0:
        df.at[i+1,'E'] = df['D'][i+1]
    else:
        df.at[i+1,'E'] = df['C'][i+1]

print(f"Operation on pd df took {time.time() - time_start} seconds.")

Output:

Operation on pd df took 84.00791883468628 seconds.

As operations on the DataFrame usually are quite slow, we can operate on the underlying numpy arrays instead.

time_start = time.time()
df = pd.DataFrame(data=data, columns=["A", "B", "C", "D"])
c_vals = df["C"].values
d_vals = df["D"].values

e_vals = [d_vals[0]]
last_e = e_vals[0]
for i in range(len(df)-1):
    if last_e < 0:
        last_e = d_vals[i+1]
    else:
        last_e = c_vals[i+1]
    e_vals.append(last_e)

df['E'] = e_vals

print(f"Operation on np array took {time.time() - time_start} seconds.")

Output:

Operation on np array took 2.2387869358062744 seconds.

Now we can argue that for loops are slow in Python and we can use a JIT compiler that can deal with numpy arrays, for instance numba.

import numba

time_start = time.time()
df = pd.DataFrame(data=data, columns=["A", "B", "C", "D"])
c_vals = df["C"].values
d_vals = df["D"].values

@numba.jit(nopython=True)
def numba_calc(c_vals, d_vals):
    e_vals = [d_vals[0]]
    last_e = e_vals[0]
    for i in range(len(c_vals)-1):
        if last_e < 0:
            last_e = d_vals[i+1]
        else:
            last_e = c_vals[i+1]
        e_vals.append(last_e)

    return e_vals

df["E"] = numba_calc(c_vals, d_vals)

print(f"Operation on np array with numba took {time.time() - time_start} seconds.")

Output:

Operation on np array with numba took 1.2623450756072998 seconds.

So especially for larger DataFrames using numba will pay out, while operating on the raw numpy arrays mostly gives a nice runtime improvement.

Collectives™ on Stack Overflow

How to vectorise this for loop in Python?

2 Answers 2

Short Answer

Long Answer

Useful Answer

Benchmark

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Short Answer

Long Answer

Useful Answer

Benchmark

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related