Create Multi-Index empty DataFrame to join with main DataFrame [duplicate]

Question

Suppose that I have a dataframe which can be created using code below

df = pd.DataFrame(data = {'date':['2021-01-01', '2021-01-02', '2021-01-05','2021-01-02', '2021-01-03', '2021-01-05'],
                          'product':['A', 'A', 'A', 'B', 'B', 'B'],
                          'price':[10, 20, 30, 40, 50, 60]
                         }
                 )
df['date'] = pd.to_datetime(df['date'])

I want to create an empty dataframe let's say main_df which will contain all dates between df.date.min() and df.date.max() for each product and on days where values in nan I want to ffill and bfill for remaning. The resulting dataframe would be as below:

+------------+---------+-------+
|    date    | product | price |
+------------+---------+-------+
| 2021-01-01 | A       |    10 |
| 2021-01-02 | A       |    20 |
| 2021-01-03 | A       |    20 |
| 2021-01-04 | A       |    20 |
| 2021-01-05 | A       |    30 |
| 2021-01-01 | B       |    40 |
| 2021-01-02 | B       |    40 |
| 2021-01-03 | B       |    50 |
| 2021-01-04 | B       |    50 |
| 2021-01-05 | B       |    60 |
+------------+---------+-------+

You would like to only keep dates that are between the max and min dates of each group, so how come they are still included in your desired output? — Anoushiravan R
– Anoushiravan R, Commented Nov 19, 2022 at 11:14
It is your assumption that I want to keep min and max of each group. As stated in OP df.date.min() is global min similarly df.date.max() is global max — Lopez
– Lopez, Commented Nov 19, 2022 at 11:19
"Between" means boundary values inclusive. NOT "Exclusive". The resulting output is shown to be clear on what is required not to argue. — Lopez
– Lopez, Commented Nov 19, 2022 at 11:27

Panda Kim · Accepted Answer · 2022-11-19 11:36:58Z

First

make pivot table, upsampling by asfreq and fill null

df.pivot_table('price', 'date', 'product').asfreq('D').ffill().bfill()

output:

product     A       B
date        
2021-01-01  10.0    40.0
2021-01-02  20.0    40.0
2021-01-03  20.0    50.0
2021-01-04  20.0    50.0
2021-01-05  30.0    60.0

Second

stack result and so on (include full code)

(df.pivot_table('price', 'date', 'product').asfreq('D').ffill().bfill()
 .stack().reset_index().rename(columns={0:'price'})
 .sort_values('product').reset_index(drop=True))

output:

    date        product price
0   2021-01-01  A       10.0
1   2021-01-02  A       20.0
2   2021-01-03  A       20.0
3   2021-01-04  A       20.0
4   2021-01-05  A       30.0
5   2021-01-01  B       40.0
6   2021-01-02  B       40.0
7   2021-01-03  B       50.0
8   2021-01-04  B       50.0
9   2021-01-05  B       60.0

user2314737 · Accepted Answer · 2022-11-19 13:10:00Z

Using resample

df = pd.DataFrame(data = {'date':['2021-01-01', '2021-01-02', '2021-01-05','2021-01-02', '2021-01-03', '2021-01-06'],
                                'product':['A', 'A', 'A', 'B', 'B', 'B'],
                                'price':[10, 20, 30, 40, 50, 60]
                               }
                       )
df['date'] = pd.to_datetime(df['date'])

df
# Out: 
#          date product  price
# 0  2021-01-01       A     10
# 1  2021-01-02       A     20
# 2  2021-01-05       A     30
# 3  2021-01-02       B     40
# 4  2021-01-03       B     50
# 5  2021-01-06       B     60




df.set_index("date").groupby("product")["price"].resample("d").ffill().reset_index()
# Out: 
#   product       date  price
# 0       A 2021-01-01     10
# 1       A 2021-01-02     20
# 2       A 2021-01-03     20
# 3       A 2021-01-04     20
# 4       A 2021-01-05     30
# 5       B 2021-01-02     40
# 6       B 2021-01-03     50
# 7       B 2021-01-04     50
# 8       B 2021-01-05     50
# 9       B 2021-01-06     60

See the rows that have been filled by ffill:

df.set_index("date").groupby("product")["price"].resample("d").mean()
# Out: 
# product  date      
# A        2021-01-01    10.0
#          2021-01-02    20.0
#          2021-01-03     NaN
#          2021-01-04     NaN
#          2021-01-05    30.0
# B        2021-01-02    40.0
#          2021-01-03    50.0
#          2021-01-04     NaN
#          2021-01-05     NaN
#          2021-01-06    60.0
# Name: price, dtype: float64

Note that by grouping by product before resampling and filling the empty slots, you can have different ranges (from min to max) for each product (I modified the data to showcase this).

Collectives™ on Stack Overflow

Create Multi-Index empty DataFrame to join with main DataFrame [duplicate]

2 Answers 2

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Linked

Related