2

Imagine I have a dataframe that looks like:

ID      DATE         VALUE_1   Value_2 ...
1    31-01-2006        5         "USD"  
1    31-01-2007        5         "USD"
1    31-01-2008        10        "USD"
1    31-01-2011        11        "USD"
2    31-12-2006        5         "USD"
2    31-12-2007        5         "USD"
2    31-12-2008        5         "USD"
2    31-12-2009        5         "USD"

With X more columns.

As you can see this is panel data with multiple entries on the same date for different IDs. What I want to do is fill in missing dates for each ID. You can see that for ID "1" there is a jump in months between the second and third entry.

I would like a dataframe that looks like the one below - keep in mind that I am looking for a solution that works for dataframes with many value columns +30 and many ID's (1000+), and still is efficient. I.e there should NOT be any data filling for ID's that are already "complete", meaning, that they already have a frequency as specified by the data. In this case, a yearly frequency. Keep in mind though, that even though they have a yearly frequency, they don't always follow the calendar year.

ID      DATE         VALUE_1   Value_2 ...
1    31-01-2006        5         "USD"  
1    31-01-2007        5         "USD"
1    31-01-2008        10        "USD"
1    31-01-2009        NA          NA
1    31-01-2010        NA          NA
1    31-01-2011        11        "USD"
2    31-12-2006        5         "USD"
2    31-12-2007        5         "USD"
2    31-12-2008        5         "USD"
2    31-12-2009        5         "USD"
2
  • Is the date range fixed or is it different for each id? What if the first two rows were missing as well in your example? Would there anything to be filled in that case? Commented Feb 2, 2019 at 0:57
  • @a_guest The date range is different for each id. There are never missing rows for the first dates for a specific ID. The start date may vary for each ID, but those dates previous to the start date for a specific ID will not be in the table initially. Commented Feb 2, 2019 at 1:53

1 Answer 1

1

Here is a fully flexible solution:

def resample_custom_freq(data):
    """ Resample datetime using different time offsets """

    # Compute the offsets
    month = data['Month'][0] - 1
    day = data['Day'][0] - 1

    # Modify data
    data = data.resample('AS').last().drop('ID', axis=1).reset_index().reset_index()
    data.loc[:, 'DATE'] += pd.offsets.MonthOffset(month)
    data.loc[:, 'DATE'] += pd.offsets.DateOffset(day)
    return data

df['DATE'] =  pd.to_datetime(df['DATE'])
df['Month'] = df['DATE'].dt.month
df['Day'] = df['DATE'].dt.day
df.set_index('DATE', inplace=True, drop=True)
df_1 = df.groupby('ID').apply(resample_custom_freq).reset_index().drop(['level_1', 'index', 'Month', 'Day'], axis=1)

df_1
Out[264]: 
   ID       DATE  VALUE_1 Value_2
0   1 2006-01-31      5.0   "USD"
1   1 2007-01-31      5.0   "USD"
2   1 2008-01-31     10.0   "USD"
3   1 2009-01-31      NaN     NaN
4   1 2010-01-31      NaN     NaN
5   1 2011-01-31     11.0   "USD"
6   2 2006-12-31      5.0   "USD"
7   2 2007-12-31      5.0   "USD"
8   2 2008-12-31      5.0   "USD"
9   2 2009-12-31      5.0    "USD
Sign up to request clarification or add additional context in comments.

4 Comments

The solution seems really nice for monthly data. Check my update question with regards to yearly data.
The problem with your solution is that it redefines the data to be end of year. I don't want that. I want the yearly frequency but not for everything to be end of year.
Here is an updated answer that works if your year ends are in January and December.
Is there no way to make it dynamic i.e suitable for all possible year ends?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.