0

I have a dataframe that looks like this:

import pandas as pd

data = [['A', '2022-09-01', '2022-09-05', 10], ['A', '2022-09-05', '2022-09-15', 1], ['A', '2022-09-15', '2022-09-18', 12], ['B', '2022-09-01', '2022-09-03', 4], ['B', '2022-09-03', '2022-09-07', 7], ['B', '2022-09-07', '2022-09-12', 9]]
df = pd.DataFrame(data, columns=['GROUP', 'start_date', 'end_date', 'value'])

  GROUP  start_date    end_date  value
0     A  2022-09-01  2022-09-05     10
1     A  2022-09-05  2022-09-15      1
2     A  2022-09-15  2022-09-18     12
3     B  2022-09-01  2022-09-03      4
4     B  2022-09-03  2022-09-07      7
5     B  2022-09-07  2022-09-12      9

I have a certain_date, for example, 2022-09-10. I would like to split the row where the certain_date is in the range of start_date and end_date per row per group. If the certain_date is in the range of the two dates, the end_date of that row should change to certain_date and add an extra row below where the start_date is the certain_date and the end_date is the original end_date where the value should be the same for both rows. Here you can see the expected output:

certain_date = '2022-09-10'

data = [['A', '2022-09-01', '2022-09-05', 10], ['A', '2022-09-05', '2022-09-10', 1], ['A', '2022-09-10', '2022-09-15', 1], ['A', '2022-09-15', '2022-09-18', 12], ['B', '2022-09-01', '2022-09-03', 4], ['B', '2022-09-03', '2022-09-07', 7], ['B', '2022-09-07', '2022-09-10', 9], ['B', '2022-09-10', '2022-09-12', 9]]
df_desired = pd.DataFrame(data, columns=['GROUP', 'start_date', 'end_date', 'value'])

  GROUP  start_date    end_date  value
0     A  2022-09-01  2022-09-05     10
1     A  2022-09-05  2022-09-10      1
2     A  2022-09-10  2022-09-15      1
3     A  2022-09-15  2022-09-18     12
4     B  2022-09-01  2022-09-03      4
5     B  2022-09-03  2022-09-07      7
6     B  2022-09-07  2022-09-10      9
7     B  2022-09-10  2022-09-12      9

For GROUP A you can see that the certain_date is in the range of the dates in the second row of the dataframe. As you can see it gets to split the way I described above. So I was wondering if there is a way to solve this using pandas?

1

2 Answers 2

2

You can identify the matching rows, then concat the dataframe without those and the slice with changed start or stop:

certain_date = '2022-09-10'
# is date after start?
m1 = df['start_date'].lt(certain_date)
# is date before stop?
m2 = df['end_date'].gt(certain_date)
# is both? (you could do all in one line)
m = m1&m2

out = pd.concat([df[~m], # data without matches
                 # changed end
                 df[m].assign(end_date=certain_date),
                 # changed start
                 df[m].assign(start_date=certain_date)]).sort_index()

Alternative: one-liner conditional and shorter concat:

m = df['start_date'].lt(certain_date) & df['end_date'].gt(certain_date)

out = (pd  # changed start for matches
  .concat([df.assign(end_date=df['end_date'].mask(m, certain_date)),
           # extra rows for stop
           df[m].assign(start_date=certain_date)]).sort_index()
     )

output:

  GROUP  start_date    end_date  value
0     A  2022-09-01  2022-09-05     10
1     A  2022-09-05  2022-09-10      1
1     A  2022-09-10  2022-09-15      1
2     A  2022-09-15  2022-09-18     12
3     B  2022-09-01  2022-09-03      4
4     B  2022-09-03  2022-09-07      7
5     B  2022-09-07  2022-09-10      9
5     B  2022-09-10  2022-09-12      9
Sign up to request clarification or add additional context in comments.

Comments

1

You can split the start_date, certain_date and end_date into list then explode the result

df[['start_date', 'end_date']] = df.apply(lambda row: [[row['start_date'], certain_date],
                                                       [certain_date, row['end_date']]]
                                          if row['start_date'] < certain_date < row['end_date']
                                          else [row['start_date'], row['end_date']],
                                          axis=1, result_type='expand')
out = df.explode(['start_date', 'end_date'], ignore_index=True)
print(out)

  GROUP  start_date    end_date  value
0     A  2022-09-01  2022-09-05     10
1     A  2022-09-05  2022-09-10      1
2     A  2022-09-10  2022-09-15      1
3     A  2022-09-15  2022-09-18     12
4     B  2022-09-01  2022-09-03      4
5     B  2022-09-03  2022-09-07      7
6     B  2022-09-07  2022-09-10      9
7     B  2022-09-10  2022-09-12      9

print(df) before out step

  GROUP                start_date                  end_date  value
0     A                2022-09-01                2022-09-05     10
1     A  [2022-09-05, 2022-09-10]  [2022-09-10, 2022-09-15]      1
2     A                2022-09-15                2022-09-18     12
3     B                2022-09-01                2022-09-03      4
4     B                2022-09-03                2022-09-07      7
5     B  [2022-09-07, 2022-09-10]  [2022-09-10, 2022-09-12]      9

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.