3

I have a pandas-dataframe that looks like:

INPUT - here the example runnable code to create the INPUT:

#Create Dataframe with example data
df_example = pd.DataFrame(columns=["START_D","ID_1", "ID_2", "STOP_D"])
df_example["START_D"] = ['2014-06-16', '2014-06-01', '2016-05-01','2014-05-28', '2014-05-20', '2015-09-01']  
df_example['ID_1'] = [1,2,3,2,1,1]
df_example['ID_2'] = ['a', 'a', 'b', 'b', 'a', 'a']
df_example["STOP_D"] = ['2014-07-28', '2014-07-01', '2016-06-01', '2014-08-01', '2014-07-29', '2015-10-01']  

#Convert to datetime
df_example["START_D"] = pd.to_datetime(df_example["START_D"])
df_example["STOP_D"] = pd.to_datetime(df_example["STOP_D"])
df_example

 START_D  ID_1 ID_2     STOP_D
 0 2014-06-16     1    a 2014-07-28
 1 2014-06-01     2    a 2014-07-01
 2 2016-05-01     3    b 2016-06-01
 3 2014-05-28     2    b 2014-08-01
 4 2014-05-20     1    a 2014-07-29
 5 2015-09-01     1    a 2015-10-01

and I am looking for a way to group by ID_1 and merge the rows where the START_D and STOP_D overlaps. The start_d will be the smallest and the stop_d the greatest. Below you can see the desired output that I get looping over all rows (iterrows) and checking one element at time.

OUTPUT Even if this approach works I think it is slow (for large DF) and I think there must be a more pythonic-pandas way to do that.

>>> df_result
     START_D    ID_1     STOP_D
  0 2014-05-20     1 2014-07-29
  1 2014-05-28     2 2014-08-01
  2 2016-05-01     3 2016-06-01
  3 2015-09-01     1 2015-10-01

thanks!

1

2 Answers 2

1
  • sort_values
  • groupby('ID_1')
  • track STOP_D.cummax() and see if START_D is less than prior cummax
  • cumsum to generate groupings
  • agg to grab min START_D and max STOP_D

df_example = df.sort_values(['START_D', 'STOP_D'])

def collapse(df):
    s, e = 'START_D', 'STOP_D'
    grps = df[s].gt(df[e].cummax().shift()).cumsum()
    funcs = {s: 'min', e: 'max', 'ID_1': 'first'}
    return df.groupby(grps).agg(funcs)

df_example.groupby('ID_1').apply(collapse).reset_index(drop=True)

enter image description here

Sign up to request clarification or add additional context in comments.

Comments

0

The difficulty in your problem is that the aggregation needs to result in a single entry. So if there are non-overlapping START_D and STOP_D, but the ID1 is the same, no aggregation (even custom made) will work. I recommend the following steps:

  1. Loop through each ID and ensure that the desired overlap is always occurring. This may be able to be vectorized with some witty coding. In cases where a conflict is found, generate a new ID (using a new column like ID3) to resolve the conflict. Otherwise just put ID1 into ID3 if no conflict exists.
  2. Do a groupby using ID3 (or whatever you chose to call it)

    df_result = df_example.groupby(['ID1']).agg({START_D: min, STOP_D: max})
    

The key to the performance boost is coming up with a vectorized solution to checking for start and stop conflict. Good luck! Hope this helps!

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.