4

My data has 4 columns A-D which have integers. I am adding a new column E, which takes its first value same as first value in column D. The next value in E should be the corresponding value in column D if previous value in column E is negative, otherwise it takes corresponding value in column C.

import pandas as pd
import numpy as np
from pandas import Series, DataFrame
data=pd.read_excel('/Users/xxxx/Documents/PY Notebooks/Data/yyyy.xlsx')
data1=data.copy()
data1['E']=np.nan
data1.at[0,'E']=data1['D'][0]
l=len(data1)
for i in range(l-1):
    if data1['E'][i]<0:
        data1.at[i+1,'E']=data1['D'][i+1]
    else:
        data1.at[i+1,'E']=data1['C'][i+1]
3
  • This'll be difficult to vectorize in the SIMD sense since each iteration's calculation depends on the previous calculation. Commented Nov 3, 2020 at 17:58
  • 1
    are there any assumptions about the data, e.g. D and E cannot be both negative? Commented Nov 3, 2020 at 17:59
  • 1
    It would be easier to get an answer if you provide both sample data and expected output. Commented Nov 3, 2020 at 18:45

2 Answers 2

1

TL;DR: go to the benchmark code and use Method 1.

Short Answer

No. Vectorization is not possible.

Long Answer

Theorem: For this particular task, the output of a given row cannot be determined using any finite length of backward rolling window smaller than the partial length up to this row.

Thus, there is no way for this output logic to be processed in a vectorized way. (See this answer for an idea of vectorization is performed in CPUs). The output can only be computed from the beginning of the dataframe.

Proof: Consider a target row of a dataframe df. Assume there is a backward rolling window with size n < partial length, so a previous value of df["E"] exists before the window. We denote this previous value by state.

Consider a special case: df["C"] == -1 and df["D"] == 1 within the window.

  • Case 1 (state < 0): The output within this rolling window will be [1, -1, 1, -1, .....], making the last element (-1)^(n-1)
  • Case 2 (state >= 0): The output will be [-1, 1, -1, 1, .....], making the last element (-1)^(n)

Therefore, it is possible for the output df["E"] of the target row to be dependent on a state variable outside the window. QED.

Useful Answer

Although vectorization is impossible, it does not mean that significant acceleration cannot be achieved. A simple yet very efficient approach is using a numba-compiled generator to perform the sequential generation. It only requires re-writing your logic into a generator function and add two additional lines:

import numba

@numba.njit
def my_generator_func():
    ....

Of course, you may have to install numba first. If this is not possible, then using a plain generator without numba optimization is also fine.

Benchmark

The benchmark is performed on a i5-8250U (4C8T) laptop with 16GB RAM running 64-bit debian 10. Python version is 3.7.9 and pandas is 1.1.3. n = 10^7 (10 million) records are generated for benchmarking purposes.

Result:

1. numba-njit: 2.48s
2. plain generator (no numba): 5.13s
3. original: 271.15s

> 100x efficiency gain can be achieved against the original code.

Code

from datetime import datetime
import pandas as pd
import numpy as np

n = 10000000  # a large number of rows
df = pd.DataFrame({"C": -np.ones(n), "D": np.ones(n)})
#print(df.head())

# ========== Method 1. generator + numba njit ==========
ti = datetime.now()

import numba

@numba.njit
def gen(plus: np.array, minus: np.array):
    l = len(plus)
    assert len(minus) == l
    # first
    state = minus[0]
    yield state
    # second to last
    for i in range(l-1):
        state = minus[i+1] if state < 0 else plus[i+1]
        yield state

df["E"] = [i for i in gen(df["C"].values, df["D"].values)]

tf = datetime.now()
print(f"1. numba-njit: {(tf-ti).total_seconds():.2f}s")  # 1. numba-njit: 0.47s

# ========== Method 2. Generator without numba ==========
df = pd.DataFrame({"C": -np.ones(n), "D": np.ones(n)})
ti = datetime.now()

def gen_plain(plus: np.array, minus: np.array):
    l = len(plus)
    assert len(minus) == l
    # first
    state = minus[0]
    yield state
    # second to last
    for i in range(l-1):
        state = minus[i+1] if state < 0 else plus[i+1]
        yield state

df["E"] = [i for i in gen_plain(df["C"].values, df["D"].values)]

tf = datetime.now()
print(f"2. plain generator (no numba): {(tf-ti).total_seconds():.2f}s")  #

# ========== Method 3. Direct iteration ==========
df = pd.DataFrame({"C": -np.ones(n), "D": np.ones(n)})
ti = datetime.now()

# code provided by the OP
df['E']=np.nan
df.at[0,'E'] = df['D'][0]
l=len(df)
for i in range(l - 1):
    if df['E'][i] < 0:
        df.at[i+1,'E'] = df['D'][i+1]
    else:
        df.at[i+1,'E'] = df['C'][i+1]

tf = datetime.now()
print(f"3. original: {(tf-ti).total_seconds():.2f}s") # 2. 26.61s
Sign up to request clarification or add additional context in comments.

3 Comments

the proof has nothing to do with the feasibility of vectorization, or based on the wrong definition of vectorization. Counterexample: by this proof cumulative sum is not vectorizeable
Alternative perspective may also stand a point. But IMHO, if a recursive function like cumsum, under any definition, could be consider vectorized, then any re-implementation which is more efficient than simple for-looping could also be claimed vectorized. Such definition would have no logical boundary so would not make strict sense. Therefore, I will stick to a strict definition based on whether it is possible to compute the result using vectorized CPU instructions in parallel and NOT in sequential order. (cont'd)
Such definition provides a clear logic, against which the feasibility of vectorization of a given algorithm can be examined. From this perspective, cumsum is in fact the best example of an efficient library-built-in method that is not vectorized (in the strict sense). I won't go more on this topic. Anyone who does not agree can of course express their opinion freely.
0

I don't think you can vectorize this operation as you have dependent rows that need to get previous calculations being run. This being said, there still is quite some room for optimization in your functionality. Let's first check your original implementation with some random points.

import numpy as np
import pandas as pd
import time

size = 10000000
data = np.random.randint(-2, 10, size=size)
data = data.reshape([size//4, 4])

time_start = time.time()
df = pd.DataFrame(data=data, columns=["A", "B", "C", "D"])
df['E']=np.nan
df.at[0,'E'] = df['D'][0]
for i in range(len(df)-1):
    if df['E'][i]<0:
        df.at[i+1,'E'] = df['D'][i+1]
    else:
        df.at[i+1,'E'] = df['C'][i+1]

print(f"Operation on pd df took {time.time() - time_start} seconds.")

Output:

Operation on pd df took 84.00791883468628 seconds.

As operations on the DataFrame usually are quite slow, we can operate on the underlying numpy arrays instead.

time_start = time.time()
df = pd.DataFrame(data=data, columns=["A", "B", "C", "D"])
c_vals = df["C"].values
d_vals = df["D"].values

e_vals = [d_vals[0]]
last_e = e_vals[0]
for i in range(len(df)-1):
    if last_e < 0:
        last_e = d_vals[i+1]
    else:
        last_e = c_vals[i+1]
    e_vals.append(last_e)

df['E'] = e_vals

print(f"Operation on np array took {time.time() - time_start} seconds.")

Output:

Operation on np array took 2.2387869358062744 seconds.

Now we can argue that for loops are slow in Python and we can use a JIT compiler that can deal with numpy arrays, for instance numba.

import numba

time_start = time.time()
df = pd.DataFrame(data=data, columns=["A", "B", "C", "D"])
c_vals = df["C"].values
d_vals = df["D"].values

@numba.jit(nopython=True)
def numba_calc(c_vals, d_vals):
    e_vals = [d_vals[0]]
    last_e = e_vals[0]
    for i in range(len(c_vals)-1):
        if last_e < 0:
            last_e = d_vals[i+1]
        else:
            last_e = c_vals[i+1]
        e_vals.append(last_e)

    return e_vals

df["E"] = numba_calc(c_vals, d_vals)

print(f"Operation on np array with numba took {time.time() - time_start} seconds.")

Output:

Operation on np array with numba took 1.2623450756072998 seconds.

So especially for larger DataFrames using numba will pay out, while operating on the raw numpy arrays mostly gives a nice runtime improvement.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.