1

I have a dataframe with dates that comes from csv file. I need to add a column with actual days difference between the dates in my column and '6/'1/2021' date. I used

Act_Days.append((pd.to_datetime(df.date[t])- 
pd.to_datetime(df.settle_date))/np.timedelta64(1, 'D'))

this code works, but this code takes a long time to calculate as the dataset has about 30K rows and I assume it calculates row by row. Is it anyway to increase the speed. I heard that working with numpy arrays is much faster,then with pandas series, however when I try to convert my dates column to numpy array , after python doesn't subtract 6/1/2021 date. it shows an error:

dates=output.date.to_numpy()
np.datetime64(dates)-np.timedelta64('2021-6-1', 'D')
--------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-9-05fdef3e68dd> in <module>
  1 dates=output.date.to_numpy()
----> 2 np.datetime64(dates)-np.timedelta64('2021-6-1', 'D')

ValueError: Could not convert object to NumPy datetime
3
  • 1
    "2021-6-1" is not a timedelta. Were you trying to subtract two dates? In that case, they both need to be datetime64 values, and the RESULT is a `timedelta64. Commented Jun 4, 2021 at 16:52
  • can you add more code, e.g. showing why you use the indexing [t]? A general note; since pandas uses numpy arrays under the hood, I doubt you'll gain much from changing data types - you might benefit more from refactoring the code. Commented Jun 4, 2021 at 17:17
  • t is for date in the date list. Dates are coming from csv file. I converted pandas column date to list and applied Act_Days= [] for t in range(len(output)): Act_Days.append((pd.to_datetime(df.date[t])- pd.to_datetime(df.settle_date))/np.timedelta64(1, 'D')) Commented Jun 4, 2021 at 18:14

1 Answer 1

3

Given your approach, I would do it like this (not stating that this is the best/optimal solution though):

import numpy as np
import pandas as pd

# Create sample dataset with roughly 30k values
sample_dates = list(np.arange('1990-01', '2020-12', dtype='datetime64[D]'))
sample_dates = sample_dates + sample_dates + sample_dates

# Create sample dataframe
data = pd.DataFrame({
    'Dates': sample_dates
})

# Add the new column
reference_date = np.datetime64("2021-01-06", 'D')
data["Act_Days"] = data['Dates'].map(lambda date_value: int(str((np.datetime64(date_value, 'D') - reference_date)).split(' ')[0]))

# Check results
data.head()

It uses operations based on NumPy arrays and Pandas' map() method for optimized row iteration. Results look like this:

Head results

Just to clarify, string and integer parsing is done since NumPy Timedelta objects are not indexable.

Sign up to request clarification or add additional context in comments.

2 Comments

Your code works 7x faster. Thank you so much, This is exactly what I was looking for.
@Camilla very glad that it worked fine for you :) In such case, would you please mark it as the selected answer? Thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.