0

I am working with a DataFrame of almost 1M rows and want to compute a column as a function of two others. My first idea was to use .apply(axis=1) with a lambda function to do the operation, but it was extremely slow compared to when I do vectorized operation.

An example of the task:

import pandas as pd
import numpy as np
import time

df = pd.DataFrame({
    "a": np.random.randint(0, 100, 100000),
    "b": np.random.randint(0, 100, 100000)})

start1 = time.time()
df["sum1"] = df.apply(lambda row: row["a"] + row["b"], axis=1)
print("apply:", time.time() - start1)

start2 = time.time()
df["sum2"] = df["a"] + df["b"]
print("vectorized:", time.time() - start2)

Is it always the case? or there are circumstances that apply() function works more efficient than vectorised operation? and if I need custom logic on rows that cannot turn into vectorized operations, what is the recommended alternative?

2
  • We have a bunch of existing questions on this topic. I would start with cs95's answer on "How can I iterate over rows in a Pandas DataFrame?" and go from there. If you don't find a satisfactory/understandable answer, you can edit to say what you found. BTW, check out How to Ask, which has tips like starting with your own research and how to write a good title. Commented Sep 12 at 16:45
  • To find more questions, try googling site:stackoverflow.com is apply always slower than vectorized in pandas Commented Sep 12 at 16:46

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.