0

I want to update and overwrite the values of one dataframe with the values in another, based on the datetime index, for a repeated datetime index. This code illustrates my problem, I have given df1 crazy values for illustrative purposes:

#import packages
import pandas as pd
import numpy as np

#create dataframes and indices
df = pd.DataFrame(np.random.randint(0,30,size=(10, 3)), columns=(['MeanT', 'MaxT', 'MinT']))
df1 = pd.DataFrame(np.random.randint(900,1000,size=(10, 3)), columns=(['MeanT', 'MaxT', 'MinT']))

df['Location'] =[2,2,3,3,4,4,5,5,6,6]
df1['Location'] =[2,2,3,3,4,4,5,5,6,6]

df.index = ["2020-05-18 12:00:00","2020-05-19 12:00:00","2020-05-18 12:00:00","2020-05-19 12:00:00","2020-05-18 12:00:00","2020-05-19 12:00:00","2020-05-18 12:00:00","2020-05-19 12:00:00","2020-05-18 12:00:00","2020-05-19 12:00:00"]
df1.index = ["2020-05-19 12:00:00", "2020-05-20 12:00:00", "2020-05-19 12:00:00", "2020-05-20 12:00:00", "2020-05-19 12:00:00", "2020-05-20 12:00:00", "2020-05-19 12:00:00", "2020-05-20 12:00:00", "2020-05-19 12:00:00", "2020-05-20 12:00:00"]

df.index = pd.to_datetime(df.index)
df1.index = pd.to_datetime(df1.index)

Take a look at both dataframes, which shows dates 18th and 19th for df, and 19th and 20th for df1.

print(df)
                     MeanT  MaxT  MinT  Location
2020-05-18 12:00:00     28     0     9         2
2020-05-19 12:00:00     22     7    11         2
2020-05-18 12:00:00      2     7     7         3
2020-05-19 12:00:00     10    24    18         3
2020-05-18 12:00:00     10    12    25         4
2020-05-19 12:00:00     25     7    20         4
2020-05-18 12:00:00      1     8    11         5
2020-05-19 12:00:00     27    19    12         5
2020-05-18 12:00:00     25    10    26         6
2020-05-19 12:00:00     29    11    27         6

print(df1)
                     MeanT  MaxT  MinT  Location
2020-05-19 12:00:00    912   991   915         2
2020-05-20 12:00:00    936   917   965         2
2020-05-19 12:00:00    918   977   901         3
2020-05-20 12:00:00    974   971   927         3
2020-05-19 12:00:00    979   929   953         4
2020-05-20 12:00:00    988   955   939         4
2020-05-19 12:00:00    969   983   940         5
2020-05-20 12:00:00    902   904   916         5
2020-05-19 12:00:00    983   942   965         6
2020-05-20 12:00:00    928   994   933         6

I want to create a new dataframe which updates df with the values from df1, so the new df has values for the 18th from df, and the 19th and 20th from df1.

I have tried using combine_first like so:

df = df.set_index(df.groupby(level=0).cumcount(), append=True)
df1 = df1.set_index(df1.groupby(level=0).cumcount(), append=True)
 
df3 = df.combine_first(df1).sort_index(level=[1,0]).reset_index(level=1, drop=True)

which updates the dataframe, but doesn't overwrite the data for the 19th with values in df1. It produces this output:

print(df3)
                     MeanT   MaxT   MinT  Location
2020-05-18 12:00:00   28.0    0.0    9.0       2.0
2020-05-19 12:00:00   22.0    7.0   11.0       2.0
2020-05-20 12:00:00  936.0  917.0  965.0       2.0
2020-05-18 12:00:00    2.0    7.0    7.0       3.0
2020-05-19 12:00:00   10.0   24.0   18.0       3.0
2020-05-20 12:00:00  974.0  971.0  927.0       3.0
2020-05-18 12:00:00   10.0   12.0   25.0       4.0
2020-05-19 12:00:00   25.0    7.0   20.0       4.0
2020-05-20 12:00:00  988.0  955.0  939.0       4.0
2020-05-18 12:00:00    1.0    8.0   11.0       5.0
2020-05-19 12:00:00   27.0   19.0   12.0       5.0
2020-05-20 12:00:00  902.0  904.0  916.0       5.0
2020-05-18 12:00:00   25.0   10.0   26.0       6.0
2020-05-19 12:00:00   29.0   11.0   27.0       6.0
2020-05-20 12:00:00  928.0  994.0  933.0       6.0

So the values for the 18th and the 20th are correct, but the values for the 19th are still from df. I want the values from df to be overwritten with those in df1. Please help!

1
  • Oh I think it is a simple matter of doing df3 = df1.combine_first(df).sort_index(level=[1,0]).reset_index(level=1, drop=True), I think I had my combine_first in the wrong order. I will test this and find out! Commented Oct 29, 2020 at 13:10

1 Answer 1

1

you just need to use combine_first backwards. We can also use 'Location' as index instead groupby.cumcount

df3 = (df1.set_index('Location', append=True)
          .combine_first(df.set_index('Location', append=True))
          .reset_index(level='Location')
          .reindex(columns=df.columns)
          .sort_values('Location'))

print(df3)

                     Location  MeanT   MaxT   MinT
2020-05-18-12:00:00         2   28.0    0.0    9.0
2020-05-19-12:00:00         2  912.0  991.0  915.0
2020-05-20-12:00:00         2  936.0  917.0  965.0
2020-05-18-12:00:00         3    2.0    7.0    7.0
2020-05-19-12:00:00         3  918.0  977.0  901.0
2020-05-20-12:00:00         3  974.0  971.0  927.0
2020-05-18-12:00:00         4   10.0   12.0   25.0
2020-05-19-12:00:00         4  979.0  929.0  953.0
2020-05-20-12:00:00         4  988.0  955.0  939.0
2020-05-18-12:00:00         5    1.0    8.0   11.0
2020-05-19-12:00:00         5  969.0  983.0  940.0
2020-05-20-12:00:00         5  902.0  904.0  916.0
2020-05-18-12:00:00         6   25.0   10.0   26.0
2020-05-19-12:00:00         6  983.0  942.0  965.0
2020-05-20-12:00:00         6  928.0  994.0  933.0
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.