2

I have a dataframe with ~3 million rows, which looks like:

   date        size  price
0  2018-08-01  100   220
1  2018-08-01  110   245
2  2018-08-01  125   250
3  2018-08-02  110   210
4  2018-08-02  120   230
5  2018-08-02  150   260
6  2018-08-03  115   200

For each row, it is a transaction of a item. We have the date of the transaction, size and price of the item.

Now I would like to add a column called avg_price, such that the avg_price of a transaction/row is the average of k transactions on the last day that have the closest sizes as this one(very similar to the idea of k nearest neighbors).

For example, when k = 2, the avg_price of the last row above should be (210+230)/2=220 because the 2 closest transactions come with sizes 110 and 120, with corresponding prices 210 and 230.

The expected output should be:

   date        size  price avg_price
0  2018-08-01  100   220   NA
1  2018-08-01  110   245   NA
2  2018-08-01  125   250   NA
3  2018-08-02  110   210   (220+245)/2
4  2018-08-02  120   230   (245+250)/2
5  2018-08-02  150   260   (245+250)/2
6  2018-08-03  115   200   (210+230)/2

I wrote a for loop to iterate over each rows, first pick out all transactions on the last day, then sort by difference in size and calculate the average of the first k items. However, as expected this is extremely slow. Could anyone point out a more "vectorized" approach? Thanks.

Updates: number of transactions per day is not fixed, and around ~300.

8
  • 1
    Why last row is not 210+245 given that there are two 110? Commented Aug 6, 2018 at 0:28
  • @RafaelC the other 110 is not in the previous day of 8/2 it is 8/1. the closest values to 115 on date 2018-08-02 is 110 and 120 (210 and 230) Commented Aug 6, 2018 at 0:38
  • @RafaelC Because we only count the averages on the last day. Commented Aug 6, 2018 at 0:39
  • Do you have fixed number of days? For example, is it always 3 rows per date? Commented Aug 6, 2018 at 0:44
  • @RafaelC unfortunately no. Each day has ~300 rows. Commented Aug 6, 2018 at 0:47

2 Answers 2

1

I called dfa the original dataframe. First create the data you will need in dfb for a later merge_asof

k = 2 # should work for any number
dfb = dfa.copy()
dfb = dfb.sort_values(['date','size']) #actually need in dfa too
# get the k-mean
dfb['avg_price'] = dfb.groupby('date').price.rolling(k).mean().values
#to look for the k nearest sizes in merge_asof
dfb['size'] = dfb.groupby('date')['size'].rolling(k).mean().values
# add one business day to shift all the date 
dfb['date'] = dfb['date'] + pd.tseries.offsets.BDay() 
dfb = dfb.dropna().drop('price',1)
dfb['size'] = dfb['size'].astype(int) #needed for the merge_asof
print (dfb)

        date   size  avg_price
1 2018-08-02    105      232.5
2 2018-08-02    117      247.5
4 2018-08-03    115      220.0
5 2018-08-03    135      245.0

You can use merge_asof, by date and on nearest size (the sort_values are necessary for using the method):

dfa = (pd.merge_asof(dfa.sort_values('size'), dfb.sort_values('size'), 
                     on='size',by='date',direction='nearest')
         .sort_values(['date','size']).reset_index(drop=True))

and the result is for dfa:

        date  price  size  avg_price
0 2018-08-01    220   100        NaN
1 2018-08-01    245   110        NaN
2 2018-08-01    250   125        NaN
3 2018-08-02    210   110      232.5
4 2018-08-02    230   120      247.5
5 2018-08-02    260   150      247.5
6 2018-08-03    200   115      220.0
Sign up to request clarification or add additional context in comments.

Comments

1

I am not sure what your expected output is but if you want to find the mean of the closest size on the dates that have more than one transaction you could do something like this. If you are looking for something else please provide an expected output:

df = pd.read_clipboard()

# find the diff on the size column and backfill the NaN values
df['diff'] = df.groupby('date')['size'].diff().fillna(method='bfill')

# group by date and use the lambda function to find the min diff
df2 = df.groupby(['date']).apply(lambda x: x[x['diff'] == x['diff'].min()])

# find the mean of price
df2.groupby('date')['price'].mean()

date
2018-08-01    232.5
2018-08-02    220.0
Name: price, dtype: float64

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.