Python pandas: how to vectorize this function

Question

I have two DataFrames df and evol as follows (simplified for the example):

In[6]: df
Out[6]:
   data  year_final  year_init
0    12        2023       2012
1    34        2034       2015
2     9        2019       2013
...

In[7]: evol
Out[7]: 
      evolution
year           
2000   1.474946
2001   1.473874
2002   1.079157
...
2037   1.463840
2038   1.980807
2039   1.726468

I would like to operate the following operation in a vectorized way (current for loop implementation is just too long when I have Gb of data):

for index, row in df.iterrows():
    for year in range(row['year_init'], row['year_final']):
        factor = evol.at[year, 'evolution']
        df.at[index, 'data'] += df.at[index, 'data'] * factor

Complexity comes from the fact that the range of year is not the same on each row... In the above example the ouput would be:

        data  year_final  year_init
0     163673        2023       2012
1  594596046        2034       2015
2       1277        2019       2013

(full evol dataframe for testing purpose:)

      evolution
year           
2000   1.474946
2001   1.473874
2002   1.079157
2003   1.876762
2004   1.541348
2005   1.581923
2006   1.869508
2007   1.289033
2008   1.924791
2009   1.527834
2010   1.762448
2011   1.554491
2012   1.927348
2013   1.058588
2014   1.729124
2015   1.025824
2016   1.117728
2017   1.261009
2018   1.705705
2019   1.178354
2020   1.158688
2021   1.904780
2022   1.332230
2023   1.807508
2024   1.779713
2025   1.558423
2026   1.234135
2027   1.574954
2028   1.170016
2029   1.767164
2030   1.995633
2031   1.222417
2032   1.165851
2033   1.136498
2034   1.745103
2035   1.018893
2036   1.813705
2037   1.463840
2038   1.980807
2039   1.726468

this is really complicated to vectorize from just pandas community so added numpy tag. numba for speed. — Bharath M Shetty
– Bharath M Shetty, Commented Sep 18, 2017 at 14:38

chrisb · Accepted Answer · 2017-09-18 17:40:01Z

3

One vectorization approach using only pandas is to do a cartesian join between the two frames and subset. Would start out like:

df['dummy'] = 1
evol['dummy'] = 1
combined = df.merge(evol, on='dummy')
# filter date ranges, multiply etc

This will likely be faster than what you are doing, but is memory inefficient and might blow up on your real data.

If you can take on the numba dependency, something like this should be very fast - essentially a compiled version of what you are doing now. Something similar would be possible in cython as well. Note that this requires that the evol dataframe is sorted and contigous by year, that could be relaxed with modification.

import numba

@numba.njit
def f(data, year_final, year_init, evol_year, evol_factor):
    data = data.copy()
    for i in range(len(data)):
        year_pos = np.searchsorted(evol_year, year_init[i])
        n_years = year_final[i] - year_init[i]
        for offset in range(n_years):
            data[i] += data[i] * evol_factor[year_pos + offset]            
    return data

f(df['data'].values, df['year_final'].values, df['year_init'].values, evol.index.values, evol['evolution'].values)
Out[24]: array([   163673, 594596044,      1277], dtype=int64)

Edit: Some timings with your test data

In [25]: %timeit f(df['data'].values, df['year_final'].values, df['year_init'].values, evol.index.values, evol['evolution'].values)
15.6 µs ± 338 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [26]: %%time
    ...: for index, row in df.iterrows():
    ...:     for year in range(row['year_init'], row['year_final']):
    ...:         factor = evol.at[year, 'evolution']
    ...:         df.at[index, 'data'] += df.at[index, 'data'] * factor
Wall time: 3 ms

edited Sep 18, 2017 at 17:40

answered Sep 18, 2017 at 17:20

chrisb

52.7k8 gold badges73 silver badges70 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Bharath M Shetty Over a year ago

would you mind adding the timings, that would help a lot to find the difference.

Prikers Over a year ago

Indeed, it seems that for this case it is much better to use numba! Thanks for that, I was not able to find any efficient way to vectorize this otherwise...

Collectives™ on Stack Overflow

Python pandas: how to vectorize this function

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related