pandas - vectorized code slower than for loop

Question

I have two functions that give the same result, one vectorized and one with a "for" loop. Suprisingly the for loop is faster than the vectorized version. Any idea why is it so ?

def loop_for(df):
    gpd    = df.groupby([pd.TimeGrouper(freq="QS-JAN"), 'CD_PDP'])
    result = []
    for (quarter, unite), data in gpd:
        nb_MAT_RH   = data["MAT_RH"  ].nunique()
        nb_MAT_RHPI = data["MAT_RHPI"].nunique()
        result.append({"CD_PDP": unite, "MOIS_COMPTABLE": quarter, "nb_mat_rh" : nb_MAT_RH, "nb_MAT_RHPI" : nb_MAT_RHPI})

    return pd.DataFrame(result)


def vectorisation(df):
    b = df.groupby([pd.TimeGrouper(freq="QS-JAN"), 'CD_PDP']).apply(lambda x: pd.Series( {"nb_mat_rh"   : x["MAT_RH"  ].nunique(),
                                                                                          "nb_MAT_RHPI" : x["MAT_RHPI"].nunique()}))
    return b.reset_index()

when testing :

import timeit
print "loop"
print timeit.timeit(stmt="loop_for(df)",number= 2, setup="from __main__ import loop_for; from __main__ import df")
print "vector"
print timeit.timeit(stmt="vectorisation(df)",number= 2, setup="from __main__ import vectorisation; from __main__ import df")

it gives :

loop
6.83789801598
vector
7.13991713524

Using .apply(lambda ... ) is not really vectorization; it is essentially the same as running a for loop over the data. — Alex Riley
– Alex Riley, Commented Jul 16, 2016 at 20:26
Also, we have no idea what df you're passing to your functions. But more importantly... what @ajcr said. — piRSquared
– piRSquared, Commented Jul 16, 2016 at 21:04
good to know :) Would vectorization be possible here ? And if so how ? — Romain Jouin
– Romain Jouin, Commented Jul 16, 2016 at 21:54

Alicia Garcia-Raboso · Accepted Answer · 2016-07-17 02:08:03Z

1

Doing .nunique() on a SeriesGroupBy object does take advantage of vectorization:

grouped = df.groupby([pd.TimeGrouper(freq="QS-JAN"), 'CD_PDP'])

b = df.groupby('a').agg({'MAT_RH': 'nunique', 'MAT_RHPI': 'nunique'})
b = b.rename(columns={'MAT_RH': 'nb_mat_rh', 'MAT_RHPI': 'nb_MAT_RHPI'})

But without even a sample of your original df it is impossible to run any benchmarks.

edited Jul 17, 2016 at 2:08

answered Jul 17, 2016 at 2:01

Alicia Garcia-Raboso

14k1 gold badge47 silver badges48 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

MaxU - stand with Ukraine Over a year ago

@romainjouin, please consider accepting an answer it was helpful

Alicia Garcia-Raboso Over a year ago

@romainjouin if this or any answer has solved your question please consider accepting it by clicking the check-mark. This indicates to the wider community that you've found a solution and gives some reputation to both the answerer and yourself. There is no obligation to do this.

Collectives™ on Stack Overflow

pandas - vectorized code slower than for loop

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related