I am working with a DataFrame of almost 1M rows and want to compute a column as a function of two others. My first idea was to use .apply(axis=1) with a lambda function to do the operation, but it was extremely slow compared to when I do vectorized operation.
An example of the task:
import pandas as pd
import numpy as np
import time
df = pd.DataFrame({
"a": np.random.randint(0, 100, 100000),
"b": np.random.randint(0, 100, 100000)})
start1 = time.time()
df["sum1"] = df.apply(lambda row: row["a"] + row["b"], axis=1)
print("apply:", time.time() - start1)
start2 = time.time()
df["sum2"] = df["a"] + df["b"]
print("vectorized:", time.time() - start2)
Is it always the case? or there are circumstances that apply() function works more efficient than vectorised operation? and if I need custom logic on rows that cannot turn into vectorized operations, what is the recommended alternative?
site:stackoverflow.com is apply always slower than vectorized in pandas