Faster way to search for dates in numpy array of datetime.date objects?

Question

I have a large, chronologically ordered array of datetime.date objects. Many of the dates in this array are the same, however some dates are missing... (it's a time series of 'real data', so it's messy).

I want to count how many data points there are for each date, currently I do it like this:

import datetime as dt
import numpy as np

t = np.array([dt.date(2012,12,1) + dt.timedelta(n) for n in np.arange(0,31,0.25)])

Ndays = (t[-1] - t[0]).days

data_per_day = np.array([sum(t == t[0] + dt.timedelta(d)) for d in xrange(Ndays)])

However I find this to be very slow! (More than 10 minutes for approximately 400,000 data points) Is there a faster way of doing this?

maybe the call to timedelta is slowing you down. Consider comparing d with tLen = t-t[0], which you compute before instead? How big is Ndays when you have 400k dates? — Geoff
– Geoff, Commented May 30, 2013 at 12:21
The performance of the different proposed solution differs greatly depending on how many days you have. For the 400000 data points, what is het value of Ndays? — joris
– joris, Commented May 30, 2013 at 13:04
Ndays is of order 2000. The solution by @root below speeded things up by several orders of magnitude. — JesseC
– JesseC, Commented May 30, 2013 at 15:06
Okay, I just tried, and @root's method is 14 times slower than my method if you include the time to convert data types. If you don't count that bit, then his method is about 4 times faster than mine. (This test is done on 200000 dates spanning 2000 days.) — Geoff
– Geoff, Commented May 31, 2013 at 14:36

root · Accepted Answer · 2013-05-30 13:00:21Z

2

Use np.datetime64. On the data @Hans Then I get 241 ms.

In [1]: import numpy as np

In [2]: import datetime as dt

In [3]: t = np.array([dt.date(2012,12,1) + dt.timedelta(n)
                        for n in np.arange(0,31,0.00001)])

In [4]: t = t.astype(np.datetime64)

In [5]: daterange = np.arange(t[0], t[-1], dtype='datetime64[D]')

In [6]: np.bincount(daterange.searchsorted(t))
Out[6]: 
array([100000, 100000, 100000, 100000, 100000, 100000, 100000, 100000,
       100000, 100000, 100000, 100000, 100000, 100000, 100000, 100000,
       100000, 100000, 100000, 100000, 100000, 100000, 100000, 100000,
       100000, 100000, 100000, 100000, 100000, 100000, 100000])

In [7]: %timeit np.bincount(daterange.searchsorted(t))
1 loops, best of 3: 241 ms per loop

edited May 30, 2013 at 13:00

answered May 30, 2013 at 12:44

root

81.1k25 gold badges111 silver badges120 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Geoff Over a year ago

This is pretty nice. I think your bottleneck for real data will be the call to searchsorted

root Over a year ago

@Geoff -- Without knowing the characteristics of the real data it is really almost impossible to tell...

JesseC Over a year ago

Amazing, thanks, this speeded things up mightily. Only slight issue is that daterange = np.arange(t[0], t[-1], dtype='datetime64[D]') gives me the following error:

TypeError: ufunc 'true_divide' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule 'safe'

(But I found a workaround)

root Over a year ago

@JesseC -- What does np.__version__ say? Perhaps you are using an older version of numpy... An upgrade to the latest should fix that.

Geoff Over a year ago

Mine says 1.6.2 and gives that error. What version is needed, root?

|

Hans Then · Accepted Answer · 2013-05-30 12:23:51Z

0

This runs in a couple of seconds for 3,100,000 entries.

import datetime as dt
import numpy as np
from collections import Counter

t = np.array([dt.date(2012,12,1) + dt.timedelta(n) for n in np.arange(0,31,0.00001)])

c = Counter(t)
print c

answered May 30, 2013 at 12:23

Hans Then

11.4k3 gold badges36 silver badges52 bronze badges

2 Comments

joris Over a year ago

And how fast with numpy? Because for me, the method given by the OP also takes only 2-3 seconds with these data.

Hans Then Over a year ago

I didn't check. I just wanted the show how cool the Counter class is. Using numpy's datetime64 is the better solution, so I will upvote that.

Geoff · Accepted Answer · 2013-05-30 12:43:57Z

0

Here's a solution based on detecting the distance between unique dates

# Get the unique day indexes of t
indexes = np.hstack(([-1], np.nonzero(np.diff(t))[0], [len(t)-1]))
# Determine how many data points are for that day
lengths = np.hstack(np.diff(indexes))

# Pull out the actual dates for the new days
dates = t[indexes[:-1] +1]
# Convert them to indexes (or day offsets)
as_int = np.vectorize(lambda d : d.day)(dates) -1

# Make a np array of these lengths
data_per_day = np.zeros((Ndays + 1,), np.int)
data_per_day[as_int] = lengths

answered May 30, 2013 at 12:43

Geoff

8,1853 gold badges37 silver badges44 bronze badges

Collectives™ on Stack Overflow

Faster way to search for dates in numpy array of datetime.date objects?

3 Answers 3

7 Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

7 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related