I have a dataframe with ~3 million rows, which looks like:
date size price
0 2018-08-01 100 220
1 2018-08-01 110 245
2 2018-08-01 125 250
3 2018-08-02 110 210
4 2018-08-02 120 230
5 2018-08-02 150 260
6 2018-08-03 115 200
For each row, it is a transaction of a item. We have the date of the transaction, size and price of the item.
Now I would like to add a column called avg_price, such that the avg_price of a transaction/row is the average of k transactions on the last day that have the closest sizes as this one(very similar to the idea of k nearest neighbors).
For example, when k = 2, the avg_price of the last row above should be (210+230)/2=220 because the 2 closest transactions come with sizes 110 and 120, with corresponding prices 210 and 230.
The expected output should be:
date size price avg_price
0 2018-08-01 100 220 NA
1 2018-08-01 110 245 NA
2 2018-08-01 125 250 NA
3 2018-08-02 110 210 (220+245)/2
4 2018-08-02 120 230 (245+250)/2
5 2018-08-02 150 260 (245+250)/2
6 2018-08-03 115 200 (210+230)/2
I wrote a for loop to iterate over each rows, first pick out all transactions on the last day, then sort by difference in size and calculate the average of the first k items. However, as expected this is extremely slow. Could anyone point out a more "vectorized" approach? Thanks.
Updates: number of transactions per day is not fixed, and around ~300.
210+245given that there are two110?