Pandas dataframe iterate over rows without for loops

Question

I have a dataframe with ~3 million rows, which looks like:

   date        size  price
0  2018-08-01  100   220
1  2018-08-01  110   245
2  2018-08-01  125   250
3  2018-08-02  110   210
4  2018-08-02  120   230
5  2018-08-02  150   260
6  2018-08-03  115   200

For each row, it is a transaction of a item. We have the date of the transaction, size and price of the item.

Now I would like to add a column called avg_price, such that the avg_price of a transaction/row is the average of k transactions on the last day that have the closest sizes as this one(very similar to the idea of k nearest neighbors).

For example, when k = 2, the avg_price of the last row above should be (210+230)/2=220 because the 2 closest transactions come with sizes 110 and 120, with corresponding prices 210 and 230.

The expected output should be:

   date        size  price avg_price
0  2018-08-01  100   220   NA
1  2018-08-01  110   245   NA
2  2018-08-01  125   250   NA
3  2018-08-02  110   210   (220+245)/2
4  2018-08-02  120   230   (245+250)/2
5  2018-08-02  150   260   (245+250)/2
6  2018-08-03  115   200   (210+230)/2

I wrote a for loop to iterate over each rows, first pick out all transactions on the last day, then sort by difference in size and calculate the average of the first k items. However, as expected this is extremely slow. Could anyone point out a more "vectorized" approach? Thanks.

Updates: number of transactions per day is not fixed, and around ~300.

Why last row is not 210+245 given that there are two 110? — rafaelc
– rafaelc, Commented Aug 6, 2018 at 0:28
@RafaelC the other 110 is not in the previous day of 8/2 it is 8/1. the closest values to 115 on date 2018-08-02 is 110 and 120 (210 and 230) — It_is_Chris
– It_is_Chris, Commented Aug 6, 2018 at 0:38
@RafaelC Because we only count the averages on the last day. — James LT
– James LT, Commented Aug 6, 2018 at 0:39
Do you have fixed number of days? For example, is it always 3 rows per date? — rafaelc
– rafaelc, Commented Aug 6, 2018 at 0:44

Ben.T · Accepted Answer · 2018-08-06 03:07:30Z

I called dfa the original dataframe. First create the data you will need in dfb for a later merge_asof

k = 2 # should work for any number
dfb = dfa.copy()
dfb = dfb.sort_values(['date','size']) #actually need in dfa too
# get the k-mean
dfb['avg_price'] = dfb.groupby('date').price.rolling(k).mean().values
#to look for the k nearest sizes in merge_asof
dfb['size'] = dfb.groupby('date')['size'].rolling(k).mean().values
# add one business day to shift all the date 
dfb['date'] = dfb['date'] + pd.tseries.offsets.BDay() 
dfb = dfb.dropna().drop('price',1)
dfb['size'] = dfb['size'].astype(int) #needed for the merge_asof
print (dfb)

        date   size  avg_price
1 2018-08-02    105      232.5
2 2018-08-02    117      247.5
4 2018-08-03    115      220.0
5 2018-08-03    135      245.0

You can use merge_asof, by date and on nearest size (the sort_values are necessary for using the method):

dfa = (pd.merge_asof(dfa.sort_values('size'), dfb.sort_values('size'), 
                     on='size',by='date',direction='nearest')
         .sort_values(['date','size']).reset_index(drop=True))

and the result is for dfa:

        date  price  size  avg_price
0 2018-08-01    220   100        NaN
1 2018-08-01    245   110        NaN
2 2018-08-01    250   125        NaN
3 2018-08-02    210   110      232.5
4 2018-08-02    230   120      247.5
5 2018-08-02    260   150      247.5
6 2018-08-03    200   115      220.0

It_is_Chris · Accepted Answer · 2018-08-05 23:44:44Z

1

I am not sure what your expected output is but if you want to find the mean of the closest size on the dates that have more than one transaction you could do something like this. If you are looking for something else please provide an expected output:

df = pd.read_clipboard()

# find the diff on the size column and backfill the NaN values
df['diff'] = df.groupby('date')['size'].diff().fillna(method='bfill')

# group by date and use the lambda function to find the min diff
df2 = df.groupby(['date']).apply(lambda x: x[x['diff'] == x['diff'].min()])

# find the mean of price
df2.groupby('date')['price'].mean()

date
2018-08-01    232.5
2018-08-02    220.0
Name: price, dtype: float64

answered Aug 5, 2018 at 23:44

It_is_Chris

14.2k3 gold badges27 silver badges45 bronze badges

Collectives™ on Stack Overflow

Pandas dataframe iterate over rows without for loops

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related