Skip to main content
added 1418 characters in body
Source Link
Maarten Fabré
  • 9.4k
  • 1
  • 16
  • 27

alternative without merge_asof

Since apparently merge_asof doesn't work as good with duplicate data, here a variant with a loop. If there are a lot of weekends, this might be slower, but I reckon it will still be faster than the original code

def mark_runin(time, week_endpoints, run_in, direction='backward'):
    mask = np.zeros_like(time, dtype=bool)
    for point in week_endpoints:
        interval = (point, point + run_in) if direction == 'forward' else (point - run_in, point)
        mask |= time.between(*interval).values
    return mask
mark_runin(time, weekend_start, run_in)
array([False,  True,  True,  True,  True,  True,  True, False, False, False, False, False], dtype=bool)
def drop_irregular_gaps2(data, gap_max, run_in, time_label = 'time'):
    times = data[time_label]
    weekend_start, week_start = find_weekend(times, gap_max)
    before_weekend = mark_runin(times, weekend_start, run_in, direction = 'backward')
    after_weekend = mark_runin(times, week_start, run_in, direction = 'forward')
    to_drop = before_weekend | after_weekend
    return data[~to_drop]
drop_irregular_gaps2(data, gap_max, run_in)
  time        values
0 2018-01-01  0.417022004702574
9 2018-01-17  0.538816734003357
10    2018-01-18  0.4191945144032948
11    2018-01-19  0.6852195003967595

alternative without merge_asof

Since apparently merge_asof doesn't work as good with duplicate data, here a variant with a loop. If there are a lot of weekends, this might be slower, but I reckon it will still be faster than the original code

def mark_runin(time, week_endpoints, run_in, direction='backward'):
    mask = np.zeros_like(time, dtype=bool)
    for point in week_endpoints:
        interval = (point, point + run_in) if direction == 'forward' else (point - run_in, point)
        mask |= time.between(*interval).values
    return mask
mark_runin(time, weekend_start, run_in)
array([False,  True,  True,  True,  True,  True,  True, False, False, False, False, False], dtype=bool)
def drop_irregular_gaps2(data, gap_max, run_in, time_label = 'time'):
    times = data[time_label]
    weekend_start, week_start = find_weekend(times, gap_max)
    before_weekend = mark_runin(times, weekend_start, run_in, direction = 'backward')
    after_weekend = mark_runin(times, week_start, run_in, direction = 'forward')
    to_drop = before_weekend | after_weekend
    return data[~to_drop]
drop_irregular_gaps2(data, gap_max, run_in)
  time        values
0 2018-01-01  0.417022004702574
9 2018-01-17  0.538816734003357
10    2018-01-18  0.4191945144032948
11    2018-01-19  0.6852195003967595
added 761 characters in body
Source Link
Maarten Fabré
  • 9.4k
  • 1
  • 16
  • 27

The gap can be found by using DataFrame.shift. This function assumes the time is in the index. If this is not the case, you might need to adapt this a bit.

The gap can be found by using DataFrame.shift. This function assumes the time is in the index. If this is not the case, you might need to adapt this a bit.

The gap can be found by using DataFrame.shift.

added 761 characters in body
Source Link
Maarten Fabré
  • 9.4k
  • 1
  • 16
  • 27

Datetime data

The algorithm should be agnostic about whether the time_label data is numeric or datetime. I verified this algorithm also works with this dummy data

data_start = pd.Timestamp('20180101')
time = data_start + pd.to_timedelta([0, 1, 2, 3, 7, 8, 9, 13, 15, 16, 17, 18], unit='day')
gap_max, run_in = pd.to_timedelta(3, unit='day'), pd.to_timedelta(2, unit='day')
values = np.random.random(size = len(indices))
data = pd.DataFrame({'time': time, 'values': values})
drop_irregular_gaps(data, gap_max, run_in)
      time        values
0     2018-01-01  0.417022004702574
9     2018-01-17  0.538816734003357
10    2018-01-18  0.4191945144032948
11    2018-01-19  0.6852195003967595

Datetime data

The algorithm should be agnostic about whether the time_label data is numeric or datetime. I verified this algorithm also works with this dummy data

data_start = pd.Timestamp('20180101')
time = data_start + pd.to_timedelta([0, 1, 2, 3, 7, 8, 9, 13, 15, 16, 17, 18], unit='day')
gap_max, run_in = pd.to_timedelta(3, unit='day'), pd.to_timedelta(2, unit='day')
values = np.random.random(size = len(indices))
data = pd.DataFrame({'time': time, 'values': values})
drop_irregular_gaps(data, gap_max, run_in)
      time        values
0     2018-01-01  0.417022004702574
9     2018-01-17  0.538816734003357
10    2018-01-18  0.4191945144032948
11    2018-01-19  0.6852195003967595
Source Link
Maarten Fabré
  • 9.4k
  • 1
  • 16
  • 27
Loading