1

I wish to subtract some number of months from a datetime column. Each row has a different number of months to subtract. For example,

df = pd.DataFrame({
    'timestamp': pd.date_range('2017-10-01', '2018-10-01', freq='m'),
    'delta_in_months': [1, 4, 2, 5, 1, 3, 1, 5, 2, 4, 1, 3]
})

The outcome should look as so (the day rounding irrelevant, it could be 01 or 28/29/30/31, it was easier to type in 01),

    timestamp   delta_in_months  new_timestamp
0   2017-10-31  1                2017-09-01
1   2017-11-30  4                2017-07-01
2   2017-12-31  2                2017-10-01
3   2018-01-31  5                2017-08-01
4   2018-02-28  1                2018-01-01
5   2018-03-31  3                2017-12-01
6   2018-04-30  1                2018-03-01
7   2018-05-31  5                2017-12-01
8   2018-06-30  2                2018-04-01
9   2018-07-31  4                2018-03-01
10  2018-08-31  1                2018-07-01
11  2018-09-30  3                2018-06-01

Bear in mind that this will be for a much larger dataframe.


I have tried,

months_delta = df.delta_in_months.apply(pd.tseries.offsets.MonthOffset)
df['new_timestamp'] = df.timestamp - months_delta

but this gave very unexpected results, with each row entry being a DatetimeIndex.

2 Answers 2

5

If you are landing here because you were searching for a vectorized, fast and correct solution to the problem of adding a variable number of months to a Series of Timestamps, then read on.

In some problems, we do indeed want to add actual months (the way pd.offsets.DateOffset(months=x) works), i.e.: 2021-01-31 + 1 month --> 2021-02-28, and not just "30 days". But trying to use pd.offsets.DateOffset directly raises a warning (PerformanceWarning: Adding/subtracting object-dtype array to DatetimeArray not vectorized ). For example:

  • dates + df['months'].apply(lambda m: pd.offsets.DateOffset(months=m))
  • dates + months * pd.offsets.DateOffset(months=1), which additionally is wrong in some cases (e.g. 2015-07-29 + 59 months should be 2020-06-29, not 2020-06-28).

Instead, we can do a bit of arithmetic ourselves and get a vectorized solution:

# note: not timezone-aware
def vadd_months(dates, months):
    ddt = dates.dt
    m = ddt.month - 1 + months
    mb = pd.to_datetime(pd.DataFrame({
        'year': ddt.year + m // 12,
        'month': (m % 12) + 1,
        'day': 1})) + (dates - dates.dt.normalize())
    me = mb + pd.offsets.MonthEnd()
    r = mb + (ddt.day - 1) * pd.Timedelta(days=1)
    r = np.minimum(r, me)
    return r

Usage for the OP example

df['new_timestamp'] = vadd_months(df['timestamp'], df['delta_in_months'])

Speed

n = int(100_000)
df = pd.DataFrame({
    'timestamp': pd.Series(pd.to_datetime(np.random.randint(
            pd.Timestamp('2000').value,
            pd.Timestamp('2020').value,
            n
        ))).dt.floor('1s'),
    'months': np.random.randint(0, 120, n),
})

%%time
newts = vadd_months(df['timestamp'], df['months'])
# CPU times: user 52.3 ms, sys: 4.01 ms, total: 56.3 ms

Verification

Check with the (non-vectorized) direct use of pd.offsets.DateOffset:

import warnings

%%time
with warnings.catch_warnings():
    warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)
    check = df['timestamp'] + df['months'].apply(lambda m: pd.offsets.DateOffset(months=m))
# CPU times: user 2.41 s, sys: 43.9 ms, total: 2.45 s

>>> newts.equals(check)
True

Note that vadd_months is 40x faster than the non-vectorized version, and there are no warnings to catch.

Sign up to request clarification or add additional context in comments.

Comments

0

you could try

df['new_timestamp'] = df.timestamp - pd.to_timedelta(df.delta_in_months,unit='M')
df['new_timestamp'] = df['new_timestamp'].dt.date

One important thing is to remember that 1 'M' = 30 'D'.

7 Comments

Nice! I didn't think to try this because 'M' was not listed in the pd.to_timedelta docs :'(
I have now raised this on pandas github as an issue.
yes. I sew the 'D' so I tried the same with 'M'. One important thing is to remember that 1 'M' = 30 'D'.
Units 'M' and 'Y' are no longer supported, as they do not represent unambiguous timedelta values durations.
it is often unacceptable to approximate an offset of 1 month by "30 days".
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.