UPDATE #2:
Here is a way to do what the question asks, as clarified by multiple OP comments, which is to:
- for each existing row, create enough rows in the result to accommodate consecutive days with periods of 8 hours (or, for the final day, up to 8 hours) such that the aggregate duration of this group of result rows matches the
durationInMinutes column value in the input.
- set
durationInMinutes for each result row to match the above logic.
- ensure that the
startTime of each result row is 09:00 on the appropriate date.
- ensure that the
endTime of each result row is a period of durationInMinutes later than startTime to match the above logic.
df['days'] = df.durationInMinutes // (8 * 60) + (df.durationInMinutes % (8 * 60) > 0)
df['durationInMinutes'] = df.apply(lambda x: [8 * 60] * (x.days - 1) +
[x.durationInMinutes % (8 * 60) + (x.durationInMinutes % (8 * 60) == 0) * 8 * 60], axis=1)
df['daysToAdd'] = df.days.apply(lambda x: range(x))
df = df.explode(['durationInMinutes', 'daysToAdd'])
df.startTime = pd.to_datetime(df.startTime, dayfirst=True)
df.startTime = pd.to_datetime(pd.DataFrame({
'year':df.startTime.dt.year, 'month':df.startTime.dt.month,
'day':df.startTime.dt.day + df.daysToAdd,
'hour':[9]*len(df.index)}))
df.endTime = (df.startTime.astype('int64') +
df.durationInMinutes * 60*1_000_000_000).astype('datetime64[ns]')
df = df.drop(columns=['days', 'daysToAdd']).reset_index(drop=True)
Input:
startTime endTime durationInMinutes
0 01-06-2010 09:00:00 02-06-2010 17:00:00 475
1 13-02-2016 08:30:00 14-02-2016 17:00:00 510
2 18-03-2018 09:30:00 19-03-2018 17:00:00 1440
3 23-05-2011 09:00:00 24-05-2011 17:00:00 1440
Output:
startTime endTime durationInMinutes
0 2010-06-01 09:00:00 2010-06-01 16:55:00 475
1 2016-02-13 09:00:00 2016-02-13 17:00:00 480
2 2016-02-14 09:00:00 2016-02-14 09:30:00 30
3 2018-03-18 09:00:00 2018-03-18 17:00:00 480
4 2018-03-19 09:00:00 2018-03-19 17:00:00 480
5 2018-03-20 09:00:00 2018-03-20 17:00:00 480
6 2011-05-23 09:00:00 2011-05-23 17:00:00 480
7 2011-05-24 09:00:00 2011-05-24 17:00:00 480
8 2011-05-25 09:00:00 2011-05-25 17:00:00 480
UPDATED:
Here is a way to do what you have asked in your question and clarified in your comments:
df['days'] = df.durationInMinutes // (8 * 60) +
(df.durationInMinutes % (8 * 60) > 0).astype(int)
df['durationInMinutes'] = df.apply(lambda x: [8 * 60] * (x.days - 1) +
[x.durationInMinutes % (8 * 60) +
(x.durationInMinutes % (8 * 60) == 0) * 8 * 60], axis=1)
df['daysToAdd'] = df.days.apply(lambda x: range(x))
df = df.explode(['durationInMinutes', 'daysToAdd'])
df.startTime = (pd.to_datetime(df.startTime, dayfirst=True).astype('int64') +
df.daysToAdd * 24*60*60*1_000_000_000).astype('datetime64[ns]')
df.endTime = (df.startTime.astype('int64') +
df.durationInMinutes * 60*1_000_000_000).astype('datetime64[ns]')
df = df.drop(columns=['days', 'daysToAdd']).reset_index(drop=True)
Explanation:
- add column
days with number of result rows corresponding to each input row.
- update
durationInMinutes column to contain a list for each row of the minutes in each result row originating from the input row.
- add column
daysToAdd with a list for each row of the number of days to add to the startTime for each result row originating from the input row.
- use
explode() to create result rows with one value each from the lists in durationInMinutes and daysToAdd.
- add the nanosecond equivalent of
daysToAdd to startTime.
- update
endTime to be startTime plus the nanosecond equivalent of durationInMinutes.
- use
drop() to eliminate unneeded columns, and use reset_index() to get an integer range index starting at 0 that advances by 1 for each row.
Input:
startTime endTime durationInMinutes
0 01-06-2010 09:00:00 02-06-2010 17:00:00 1445
1 13-02-2016 09:00:00 14-02-2016 17:00:00 1435
2 18-03-2018 09:00:00 19-03-2018 17:00:00 1440
3 23-05-2011 09:00:00 24-05-2011 17:00:00 1440
Output:
startTime endTime durationInMinutes
0 2010-06-01 09:00:00 2010-06-01 17:00:00 480
1 2010-06-02 09:00:00 2010-06-02 17:00:00 480
2 2010-06-03 09:00:00 2010-06-03 17:00:00 480
3 2010-06-04 09:00:00 2010-06-04 09:05:00 5
4 2016-02-13 09:00:00 2016-02-13 17:00:00 480
5 2016-02-14 09:00:00 2016-02-14 17:00:00 480
6 2016-02-15 09:00:00 2016-02-15 16:55:00 475
7 2018-03-18 09:00:00 2018-03-18 17:00:00 480
8 2018-03-19 09:00:00 2018-03-19 17:00:00 480
9 2018-03-20 09:00:00 2018-03-20 17:00:00 480
10 2011-05-23 09:00:00 2011-05-23 17:00:00 480
11 2011-05-24 09:00:00 2011-05-24 17:00:00 480
12 2011-05-25 09:00:00 2011-05-25 17:00:00 480
ORIGINAL ANSWER:
Here's a way to do what your question asks:
df = pd.concat([df.assign(durationInMinutes=df.durationInMinutes/3,
orig_row=i).reset_index() for i in range(3)])
for col in ['startTime', 'endTime']:
df[col] = (pd.to_datetime(df[col], dayfirst=True).astype('int64') +
df.orig_row * 24*60*60*1_000_000_000).astype('datetime64[ns]')
df = df.sort_values('index').drop(columns=['index', 'orig_row'])
Explanation:
- overwrite the
durationInMinutes column by dividing its contents by 3.
- create 3 copies of
df, each with a new column orig_row containing an integer corresponding to which number the copy is (0, 1 or 2).
- for each of
startTime and endTime, convert the string value to a datetime in nanoseconds and add the nanosecond equivalent of the number of days in orig_row to it (24 hours * 60 minutes * 60 seconds * 1bn nanoseconds).
- sort by the original index (so consecutive days are grouped together) and drop the intermediate columns.
Input:
startTime endTime durationInMinutes
0 01-06-2010 09:00:00 02-06-2010 17:00:00 1440
1 13-02-2016 09:00:00 14-02-2016 17:00:00 1440
2 18-03-2018 09:00:00 19-03-2018 17:00:00 1440
3 23-05-2011 09:00:00 24-05-2011 17:00:00 1440
Output:
startTime endTime durationInMinutes
0 2010-06-01 09:00:00 2010-06-02 17:00:00 480.0
0 2010-06-02 09:00:00 2010-06-03 17:00:00 480.0
0 2010-06-03 09:00:00 2010-06-04 17:00:00 480.0
1 2016-02-13 09:00:00 2016-02-14 17:00:00 480.0
1 2016-02-14 09:00:00 2016-02-15 17:00:00 480.0
1 2016-02-15 09:00:00 2016-02-16 17:00:00 480.0
2 2018-03-18 09:00:00 2018-03-19 17:00:00 480.0
2 2018-03-19 09:00:00 2018-03-20 17:00:00 480.0
2 2018-03-20 09:00:00 2018-03-21 17:00:00 480.0
3 2011-05-23 09:00:00 2011-05-24 17:00:00 480.0
3 2011-05-24 09:00:00 2011-05-25 17:00:00 480.0
3 2011-05-25 09:00:00 2011-05-26 17:00:00 480.0
TROUBLESHOOTING:
IN a comment, OP mentions getting ValueError: columns must have matching element counts in df.explode(). A print(df) just before the explode() line in the UPDATED solution in my environment gives:
startTime endTime durationInMinutes days daysToAdd
0 01-06-2010 09:00:00 02-06-2010 17:00:00 [480, 480, 480, 5] 4 (0, 1, 2, 3)
1 13-02-2016 09:00:00 14-02-2016 17:00:00 [480, 480, 475] 3 (0, 1, 2)
2 18-03-2018 09:00:00 19-03-2018 17:00:00 [480, 480, 480] 3 (0, 1, 2)
3 23-05-2011 09:00:00 24-05-2011 17:00:00 [480, 480, 480] 3 (0, 1, 2)