1

I have a dataframe in which I am trying to fill in missing months, keeping the value of previous months.

| Score_Date |  Num |        Name        | Score |
|:----------:|:----:|:------------------:|:-----:|
| 2019-12-01 | 4544 | ABC ELECTRONICS CO |   50  |
| 2020-03-01 | 4544 | ABC ELECTRONICS CO |   75  |
| 2020-06-01 | 4544 | ABC ELECTRONICS CO |   90  |
| 2020-09-01 |  454 | ABC ELECTRONICS CO |   50  |

Ideally, the dataframe would look like:

| Score_Date |  Num |        Name        | Score |
|:----------:|:----:|:------------------:|:-----:|
| 2019-12-01 | 4544 | ABC ELECTRONICS CO |   50  |
| 2020-01-01 | 4544 | ABC ELECTRONICS CO |   50  |
| 2020-02-01 | 4544 | ABC ELECTRONICS CO |   50  |
| 2020-03-01 | 4544 | ABC ELECTRONICS CO |   75  |
| 2020-04-01 | 4544 | ABC ELECTRONICS CO |   75  |
| 2020-05-01 | 4544 | ABC ELECTRONICS CO |   75  |
| 2020-06-01 | 4544 | ABC ELECTRONICS CO |   90  |
| 2020-07-01 | 4544 | ABC ELECTRONICS CO |   90  |
| 2020-08-01 | 4544 | ABC ELECTRONICS CO |   90  |
| 2020-09-01 | 4544 | ABC ELECTRONICS CO |   50  |

Where I am filling in missing month values with the value of the month before, using pandas ffill()

I found this post and tried to implement a solution:

def expand_dates(grp):
    start = grp.index.min()
    end = today
    index = pd.date_range(start, end, freq='M')
    return grp.reindex(index).ffill()
test_df = test_df.set_index('Score_Date')
test_df = test_df.groupby('Name')['Score'].apply(expand_dates)
print(pd.concat([test_df.head(), test_df.tail()]))

Yet I receive:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-26-d27510150dc0> in <module>
      4     index = pd.date_range(start, end, freq='M')
      5     return grp.reindex(index).ffill()
----> 6 test_df = test_df.set_index('Score_Date')
      7 test_df = test_df.groupby('Name')['Score'].apply(expand_dates)
      8 print(pd.concat([test_df.head(), test_df.tail()]))

c:\python367-64\lib\site-packages\pandas\core\frame.py in set_index(self, keys, drop, append, inplace, verify_integrity)
   4553 
   4554         if missing:
-> 4555             raise KeyError(f"None of {missing} are in the columns")
   4556 
   4557         if inplace:

KeyError: "None of ['Score_Date'] are in the columns"

Note, print(test_df.columns) reveals: Index(['Num', 'Name', 'Score'], dtype='object'), yet if I print(test_df), the column shows up.

CSV Data:

Score_Date,Num,Name,Score
2019-12-01,4544,ABC ELECTRONICS CO,50
2020-03-01,4544,ABC ELECTRONICS CO,75
2020-06-01,4544,ABC ELECTRONICS CO,90
2020-09-01,4544,ABC ELECTRONICS CO,50
3
  • the error is quite clear 'Score_Date' doesn't exist in the dataframe, check for trailing and leading spaces print(test_df.columns) Commented Dec 15, 2020 at 17:53
  • May I ask you how do you load your dataframe? Because in this case, for example, the problem is with the encoding argument in pd.read_csv Commented Dec 15, 2020 at 18:11
  • Just pd.read_csv, nothing fancy there. @Ralubrusto Commented Dec 15, 2020 at 18:45

2 Answers 2

2

Assuming the column Score_Date is datetime, e.g. imported with

df = pd.read_csv('yourdata.csv', parse_dates=['Score_Date'])

You can df.reindex with method='ffill'

df.set_index('Score_Date', inplace=True)
df_test = (df.reindex(pd.date_range(df.index.min(), df.index.max(), freq='MS'), method='ffill')
             .rename_axis('Score_Date').reset_index())
print(df_test)

Out:

  Score_Date   Num                Name  Score
0 2019-12-01  4544  ABC ELECTRONICS CO     50
1 2020-01-01  4544  ABC ELECTRONICS CO     50
2 2020-02-01  4544  ABC ELECTRONICS CO     50
3 2020-03-01  4544  ABC ELECTRONICS CO     75
4 2020-04-01  4544  ABC ELECTRONICS CO     75
5 2020-05-01  4544  ABC ELECTRONICS CO     75
6 2020-06-01  4544  ABC ELECTRONICS CO     90
7 2020-07-01  4544  ABC ELECTRONICS CO     90
8 2020-08-01  4544  ABC ELECTRONICS CO     90
9 2020-09-01   454  ABC ELECTRONICS CO     50
Sign up to request clarification or add additional context in comments.

1 Comment

How can I assign that reindex line to a new dataframe? Or do it in place? If I make the next line print(test_df), it shows the original.
2

Let's try reading the data and parse dates at once, then you can use asfreq:

df = pd.read_clipboard(sep=',',parse_dates=True, index_col=0); df

df.asfreq('MS').ffill()

Output:

               Num                Name  Score
Score_Date                                   
2019-12-01  4544.0  ABC ELECTRONICS CO   50.0
2020-01-01  4544.0  ABC ELECTRONICS CO   50.0
2020-02-01  4544.0  ABC ELECTRONICS CO   50.0
2020-03-01  4544.0  ABC ELECTRONICS CO   75.0
2020-04-01  4544.0  ABC ELECTRONICS CO   75.0
2020-05-01  4544.0  ABC ELECTRONICS CO   75.0
2020-06-01  4544.0  ABC ELECTRONICS CO   90.0
2020-07-01  4544.0  ABC ELECTRONICS CO   90.0
2020-08-01  4544.0  ABC ELECTRONICS CO   90.0
2020-09-01  4544.0  ABC ELECTRONICS CO   50.0

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.