0

Suppose that I have a dataframe which can be created using code below

df = pd.DataFrame(data = {'date':['2021-01-01', '2021-01-02', '2021-01-05','2021-01-02', '2021-01-03', '2021-01-05'],
                          'product':['A', 'A', 'A', 'B', 'B', 'B'],
                          'price':[10, 20, 30, 40, 50, 60]
                         }
                 )
df['date'] = pd.to_datetime(df['date'])

I want to create an empty dataframe let's say main_df which will contain all dates between df.date.min() and df.date.max() for each product and on days where values in nan I want to ffill and bfill for remaning. The resulting dataframe would be as below:

+------------+---------+-------+
|    date    | product | price |
+------------+---------+-------+
| 2021-01-01 | A       |    10 |
| 2021-01-02 | A       |    20 |
| 2021-01-03 | A       |    20 |
| 2021-01-04 | A       |    20 |
| 2021-01-05 | A       |    30 |
| 2021-01-01 | B       |    40 |
| 2021-01-02 | B       |    40 |
| 2021-01-03 | B       |    50 |
| 2021-01-04 | B       |    50 |
| 2021-01-05 | B       |    60 |
+------------+---------+-------+
5
  • 1
    You would like to only keep dates that are between the max and min dates of each group, so how come they are still included in your desired output? Commented Nov 19, 2022 at 11:14
  • It is your assumption that I want to keep min and max of each group. As stated in OP df.date.min() is global min similarly df.date.max() is global max Commented Nov 19, 2022 at 11:19
  • Does df.date.min() computes min date for each group? Commented Nov 19, 2022 at 11:21
  • "Between" means boundary values inclusive. NOT "Exclusive". The resulting output is shown to be clear on what is required not to argue. Commented Nov 19, 2022 at 11:27
  • Let us continue this discussion in chat. Commented Nov 19, 2022 at 11:28

2 Answers 2

2

First

make pivot table, upsampling by asfreq and fill null

df.pivot_table('price', 'date', 'product').asfreq('D').ffill().bfill()

output:

product     A       B
date        
2021-01-01  10.0    40.0
2021-01-02  20.0    40.0
2021-01-03  20.0    50.0
2021-01-04  20.0    50.0
2021-01-05  30.0    60.0



Second

stack result and so on (include full code)

(df.pivot_table('price', 'date', 'product').asfreq('D').ffill().bfill()
 .stack().reset_index().rename(columns={0:'price'})
 .sort_values('product').reset_index(drop=True))

output:

    date        product price
0   2021-01-01  A       10.0
1   2021-01-02  A       20.0
2   2021-01-03  A       20.0
3   2021-01-04  A       20.0
4   2021-01-05  A       30.0
5   2021-01-01  B       40.0
6   2021-01-02  B       40.0
7   2021-01-03  B       50.0
8   2021-01-04  B       50.0
9   2021-01-05  B       60.0
Sign up to request clarification or add additional context in comments.

Comments

0

Using resample

df = pd.DataFrame(data = {'date':['2021-01-01', '2021-01-02', '2021-01-05','2021-01-02', '2021-01-03', '2021-01-06'],
                                'product':['A', 'A', 'A', 'B', 'B', 'B'],
                                'price':[10, 20, 30, 40, 50, 60]
                               }
                       )
df['date'] = pd.to_datetime(df['date'])

df
# Out: 
#          date product  price
# 0  2021-01-01       A     10
# 1  2021-01-02       A     20
# 2  2021-01-05       A     30
# 3  2021-01-02       B     40
# 4  2021-01-03       B     50
# 5  2021-01-06       B     60




df.set_index("date").groupby("product")["price"].resample("d").ffill().reset_index()
# Out: 
#   product       date  price
# 0       A 2021-01-01     10
# 1       A 2021-01-02     20
# 2       A 2021-01-03     20
# 3       A 2021-01-04     20
# 4       A 2021-01-05     30
# 5       B 2021-01-02     40
# 6       B 2021-01-03     50
# 7       B 2021-01-04     50
# 8       B 2021-01-05     50
# 9       B 2021-01-06     60

See the rows that have been filled by ffill:

df.set_index("date").groupby("product")["price"].resample("d").mean()
# Out: 
# product  date      
# A        2021-01-01    10.0
#          2021-01-02    20.0
#          2021-01-03     NaN
#          2021-01-04     NaN
#          2021-01-05    30.0
# B        2021-01-02    40.0
#          2021-01-03    50.0
#          2021-01-04     NaN
#          2021-01-05     NaN
#          2021-01-06    60.0
# Name: price, dtype: float64

Note that by grouping by product before resampling and filling the empty slots, you can have different ranges (from min to max) for each product (I modified the data to showcase this).

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.