2

I have to determine if there are gaps between date sets (determined by start and end date). I have two example dataframes:

import pandas as pd

a = pd.DataFrame({'start_date' : ['01-01-2014', '01-01-2015', '05-01-2016'],
             'end_date' : ['01-01-2015', '01-01-2016', '05-01-2017']})

order = ['start_date', 'end_date']

a = a[order]

a.start_date = pd.to_datetime(a.start_date, dayfirst= True)
a.end_date = pd.to_datetime(a.end_date, dayfirst= True)


b = pd.DataFrame({'start_date' : ['01-01-2014', '01-01-2015', '05-01-2016', 
'05-01-2017', '01-01-2015'],
             'end_date' : ['01-01-2015', '01-01-2016', '05-01-2017',
                          '05-01-2018', '05-01-2018']})

order = ['start_date', 'end_date']

b = b[order]

b.start_date = pd.to_datetime(b.start_date, dayfirst= True)
b.end_date = pd.to_datetime(b.end_date, dayfirst= True)

a
b

For dataframe a, the solution is simple enough. Order by start_date, shift end_date down by one and subtract the dates, if the difference is positive, there is a gap in the dates.

However, achieving this for dataframe b is less obvious as there is a range that emcompases a larger range. I am unsure on a generic way of doing this that won't incorrectly find a gap. This will be done on grouped data (about 40000 groups).

2 Answers 2

1

This is the idea...

  • Assign a +1 for start dates and a -1 for end dates.
  • Take a cumulative sum where I order by all dates as one flat array.
  • When cumulative sum is zero... we hit a gap.
  • Date values are the first priority, followed by being a start_date. This way, we don't add a negative one before adding a positive one when the end_date of one row equals the start date of the next row.
  • I use numpy to sort stuff and twist and turn
  • return a boolean mask of where the gaps start.

def find_gaps(b):
    d1 = b.values.ravel()
    d2 = np.tile([1, -1], len(d1) // 2)
    s = np.lexsort([-d2, d1])
    u = np.empty_like(s)
    r = np.arange(d1.size)
    u[s] = r
    return d2[s].cumsum()[u][1::2] == 0

demo

find_gaps(b)

array([False, False, False, False,  True], dtype=bool)

find_gaps(a)

array([False,  True,  True], dtype=bool)
Sign up to request clarification or add additional context in comments.

Comments

1

IIUC you can do something like this:

In [198]: (b.sort_values('start_date')
     ...:   .stack()
     ...:   .shift().diff().dt.days
     ...:   .reset_index(name='days')
     ...:   .dropna()
     ...:   .query("level_1 == 'end_date' and days != 0"))
     ...:
Out[198]:
   level_0   level_1   days
5        4  end_date -365.0
7        2  end_date -731.0

The following code should show us indices where gaps were found:

In [199]: (b.sort_values('start_date')
     ...:   .stack()
     ...:   .shift().diff().dt.days
     ...:   .reset_index(name='days')
     ...:   .dropna()
     ...:   .query("level_1 == 'end_date' and days != 0")
     ...:   .loc[:, 'level_0'])
     ...:
Out[199]:
5    4
7    2
Name: level_0, dtype: int64

2 Comments

I thought you had deleted this. Well no matter :-)
@piRSquared, frankly speaking i don't remember... ;-)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.