0

I have a dataframe with airline booking data for the past year for a particular origin and destination. There are hundreds of similar data-sets in the system.

In each data-set, there are holes in data. In the current example, we have about 85 days of year for which we don't have booking data.

There are two columns here - departure_date and bookings.

The next step for me would be to include the missing dates in the date column, and set the corresponding values in bookings column to NaN.

I am looking for the best way to do this.

Please find a part of the dataFrame below:

Index       departure_date              bookings
0           2017-11-02 00:00:00             43
1           2017-11-03 00:00:00             27
2           2017-11-05 00:00:00             27 ********
3           2017-11-06 00:00:00             22
4           2017-11-07 00:00:00             39
.
.
164         2018-05-22 00:00:00             17
165         2018-05-23 00:00:00             41
166         2018-05-24 00:00:00             73
167         2018-07-02 00:00:00             4  *********
168         2018-07-03 00:00:00             31
.
.
277         2018-10-31 00:00:00             50
278         2018-11-01 00:00:00             60

We can see that the data-set is for a one year period (Nov 2, 2017 to Nov 1, 2018). But we have data for 279 days only. For example, we don't have any data between 2018-05-25 and 2018-07-01. I would have to include these dates in the departure_date column and set the corresponding booking values to NaN.

For the second step, I plan to do some interpolation using something like

dataFrame['bookings'].interpolate(method='time', inplace=True)

Please suggest if there are better alternatives in Python.

2
  • I doubt an interpolation would be accurate... Commented Jan 4, 2019 at 10:39
  • That's true.. it's only for some testing purpose. For now, I need to know how to prepare the dataFrame by including missing dates and NaN values in bookings column. There seems to be many methods to estimate missing time series data. Commented Jan 4, 2019 at 10:40

1 Answer 1

1

This resample for each day. Then fill the gaps.

dataFrame['bookings'].resample('D').pad()

You can have more resampler idea on this page (so you can select the one that fit the best with your needs): https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.