Convert dates to numpy datetime

Question

I have a dataframe with dates that comes from csv file. I need to add a column with actual days difference between the dates in my column and '6/'1/2021' date. I used

Act_Days.append((pd.to_datetime(df.date[t])- 
pd.to_datetime(df.settle_date))/np.timedelta64(1, 'D'))

this code works, but this code takes a long time to calculate as the dataset has about 30K rows and I assume it calculates row by row. Is it anyway to increase the speed. I heard that working with numpy arrays is much faster,then with pandas series, however when I try to convert my dates column to numpy array , after python doesn't subtract 6/1/2021 date. it shows an error:

dates=output.date.to_numpy()
np.datetime64(dates)-np.timedelta64('2021-6-1', 'D')
--------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-9-05fdef3e68dd> in <module>
  1 dates=output.date.to_numpy()
----> 2 np.datetime64(dates)-np.timedelta64('2021-6-1', 'D')

ValueError: Could not convert object to NumPy datetime

"2021-6-1" is not a timedelta. Were you trying to subtract two dates? In that case, they both need to be datetime64 values, and the RESULT is a `timedelta64. — Tim Roberts
– Tim Roberts, Commented Jun 4, 2021 at 16:52
can you add more code, e.g. showing why you use the indexing [t]? A general note; since pandas uses numpy arrays under the hood, I doubt you'll gain much from changing data types - you might benefit more from refactoring the code. — FObersteiner
– FObersteiner, Commented Jun 4, 2021 at 17:17
t is for date in the date list. Dates are coming from csv file. I converted pandas column date to list and applied Act_Days= [] for t in range(len(output)): Act_Days.append((pd.to_datetime(df.date[t])- pd.to_datetime(df.settle_date))/np.timedelta64(1, 'D')) — Camilla
– Camilla, Commented Jun 4, 2021 at 18:14

Sebas Arango · Accepted Answer · 2021-06-04 17:47:56Z

3

Given your approach, I would do it like this (not stating that this is the best/optimal solution though):

import numpy as np
import pandas as pd

# Create sample dataset with roughly 30k values
sample_dates = list(np.arange('1990-01', '2020-12', dtype='datetime64[D]'))
sample_dates = sample_dates + sample_dates + sample_dates

# Create sample dataframe
data = pd.DataFrame({
    'Dates': sample_dates
})

# Add the new column
reference_date = np.datetime64("2021-01-06", 'D')
data["Act_Days"] = data['Dates'].map(lambda date_value: int(str((np.datetime64(date_value, 'D') - reference_date)).split(' ')[0]))

# Check results
data.head()

It uses operations based on NumPy arrays and Pandas' map() method for optimized row iteration. Results look like this:

Just to clarify, string and integer parsing is done since NumPy Timedelta objects are not indexable.

answered Jun 4, 2021 at 17:47

Sebas Arango

3512 silver badges4 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Camilla Over a year ago

Your code works 7x faster. Thank you so much, This is exactly what I was looking for.

Sebas Arango Over a year ago

@Camilla very glad that it worked fine for you :) In such case, would you please mark it as the selected answer? Thanks!

Collectives™ on Stack Overflow

Convert dates to numpy datetime

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related