3

I have a large, chronologically ordered array of datetime.date objects. Many of the dates in this array are the same, however some dates are missing... (it's a time series of 'real data', so it's messy).

I want to count how many data points there are for each date, currently I do it like this:

import datetime as dt
import numpy as np

t = np.array([dt.date(2012,12,1) + dt.timedelta(n) for n in np.arange(0,31,0.25)])

Ndays = (t[-1] - t[0]).days

data_per_day = np.array([sum(t == t[0] + dt.timedelta(d)) for d in xrange(Ndays)])

However I find this to be very slow! (More than 10 minutes for approximately 400,000 data points) Is there a faster way of doing this?

5
  • maybe the call to timedelta is slowing you down. Consider comparing d with tLen = t-t[0], which you compute before instead? How big is Ndays when you have 400k dates? Commented May 30, 2013 at 12:21
  • 2
    The performance of the different proposed solution differs greatly depending on how many days you have. For the 400000 data points, what is het value of Ndays? Commented May 30, 2013 at 13:04
  • Ndays is of order 2000. The solution by @root below speeded things up by several orders of magnitude. Commented May 30, 2013 at 15:06
  • JesseC, out of curiosity, did you compare with my method? Commented May 31, 2013 at 12:55
  • Okay, I just tried, and @root's method is 14 times slower than my method if you include the time to convert data types. If you don't count that bit, then his method is about 4 times faster than mine. (This test is done on 200000 dates spanning 2000 days.) Commented May 31, 2013 at 14:36

3 Answers 3

2

Use np.datetime64. On the data @Hans Then I get 241 ms.

In [1]: import numpy as np

In [2]: import datetime as dt

In [3]: t = np.array([dt.date(2012,12,1) + dt.timedelta(n)
                        for n in np.arange(0,31,0.00001)])

In [4]: t = t.astype(np.datetime64)

In [5]: daterange = np.arange(t[0], t[-1], dtype='datetime64[D]')

In [6]: np.bincount(daterange.searchsorted(t))
Out[6]: 
array([100000, 100000, 100000, 100000, 100000, 100000, 100000, 100000,
       100000, 100000, 100000, 100000, 100000, 100000, 100000, 100000,
       100000, 100000, 100000, 100000, 100000, 100000, 100000, 100000,
       100000, 100000, 100000, 100000, 100000, 100000, 100000])

In [7]: %timeit np.bincount(daterange.searchsorted(t))
1 loops, best of 3: 241 ms per loop
Sign up to request clarification or add additional context in comments.

7 Comments

This is pretty nice. I think your bottleneck for real data will be the call to searchsorted
@Geoff -- Without knowing the characteristics of the real data it is really almost impossible to tell...
Amazing, thanks, this speeded things up mightily. Only slight issue is that daterange = np.arange(t[0], t[-1], dtype='datetime64[D]') gives me the following error: TypeError: ufunc 'true_divide' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule 'safe' (But I found a workaround)
@JesseC -- What does np.__version__ say? Perhaps you are using an older version of numpy... An upgrade to the latest should fix that.
Mine says 1.6.2 and gives that error. What version is needed, root?
|
0

This runs in a couple of seconds for 3,100,000 entries.

import datetime as dt
import numpy as np
from collections import Counter

t = np.array([dt.date(2012,12,1) + dt.timedelta(n) for n in np.arange(0,31,0.00001)])

c = Counter(t)
print c

2 Comments

And how fast with numpy? Because for me, the method given by the OP also takes only 2-3 seconds with these data.
I didn't check. I just wanted the show how cool the Counter class is. Using numpy's datetime64 is the better solution, so I will upvote that.
0

Here's a solution based on detecting the distance between unique dates

# Get the unique day indexes of t
indexes = np.hstack(([-1], np.nonzero(np.diff(t))[0], [len(t)-1]))
# Determine how many data points are for that day
lengths = np.hstack(np.diff(indexes))

# Pull out the actual dates for the new days
dates = t[indexes[:-1] +1]
# Convert them to indexes (or day offsets)
as_int = np.vectorize(lambda d : d.day)(dates) -1

# Make a np array of these lengths
data_per_day = np.zeros((Ndays + 1,), np.int)
data_per_day[as_int] = lengths

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.