I have to determine if there are gaps between date sets (determined by start and end date). I have two example dataframes:
import pandas as pd
a = pd.DataFrame({'start_date' : ['01-01-2014', '01-01-2015', '05-01-2016'],
'end_date' : ['01-01-2015', '01-01-2016', '05-01-2017']})
order = ['start_date', 'end_date']
a = a[order]
a.start_date = pd.to_datetime(a.start_date, dayfirst= True)
a.end_date = pd.to_datetime(a.end_date, dayfirst= True)
b = pd.DataFrame({'start_date' : ['01-01-2014', '01-01-2015', '05-01-2016',
'05-01-2017', '01-01-2015'],
'end_date' : ['01-01-2015', '01-01-2016', '05-01-2017',
'05-01-2018', '05-01-2018']})
order = ['start_date', 'end_date']
b = b[order]
b.start_date = pd.to_datetime(b.start_date, dayfirst= True)
b.end_date = pd.to_datetime(b.end_date, dayfirst= True)
a
b
For dataframe a, the solution is simple enough. Order by start_date, shift end_date down by one and subtract the dates, if the difference is positive, there is a gap in the dates.
However, achieving this for dataframe b is less obvious as there is a range that emcompases a larger range. I am unsure on a generic way of doing this that won't incorrectly find a gap. This will be done on grouped data (about 40000 groups).