1

I have dataframe with 200000 rows. Each record has a timestamp, and I need to group them by date. So I do:

In [67]: df['result_date'][0]
Out[67]: Timestamp('2017-09-01 09:12:00')

In [68]: %timeit df['result_day'] = df['result_date'].apply(lambda x: str(x.date()))
2.26 s ± 73.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [69]: df['result_day'][0]
Out[69]: '2017-09-01'

or

In [70]: %timeit df['result_day'] = df['result_date'].apply(lambda x: x.date())
2.05 s ± 213 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [71]: df['result_day'][0]
Out[71]: datetime.date(2017, 9, 1)

anyway, it takes ~2 seconds. Can I do it faster?

UPD:

In [75]: df.shape
Out[75]: (228217, 18)

In [77]: %timeit df['result_date'].dt.date
1.44 s ± 42.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

2 Answers 2

6

Using the example from jezrael. You almost never want to actually use .date; this creates python objects. .normalize() sets the time on dates to 00:00:00, effectively making them dates, but keeping them in a high performance format of datetime64[ns].

In [32]: rng = pd.date_range('2000-04-03', periods=200000, freq='2H')
    ...: df = pd.DataFrame({'result_date': rng})  
    ...: 

In [33]: %timeit df['result_date'].dt.date
482 ms ± 10.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [34]: %timeit df['result_date'].dt.normalize()
16.3 ms ± 234 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Grouping

In [39]: %timeit df.groupby(df['result_date'].dt.date).size()
506 ms ± 11.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [40]: %timeit df.groupby(df['result_date'].dt.normalize()).size()
24.2 ms ± 1.92 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Or idiomatically

In [38]: %timeit df.resample('D', on='result_date').size()
5.47 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Sign up to request clarification or add additional context in comments.

2 Comments

What about floor ? It is faster as normalize.
yep that’s reasonable, but canonical is .normalize()
1

Use dt.date, which working also with NaT very nice:

df['result_day'] = df['result_date'].dt.date

rng = pd.date_range('2000-04-03', periods=200000, freq='2H')
df = pd.DataFrame({'result_date': rng})  

In [216]: %timeit df['result_date'].dt.date
1 loop, best of 3: 474 ms per loop

In [217]: %timeit df['result_date'].apply(lambda x: str(x.date()))
1 loop, best of 3: 740 ms per loop

In [218]: %timeit df['result_date'].apply(lambda x: x.date())
1 loop, best of 3: 559 ms per loop

EDIT:

I think floor is faster as normalize:

#home's notebook, so different times as above

In [3]: %timeit df['result_date'].dt.date
1 loop, best of 3: 854 ms per loop

In [4]: %timeit df['result_date'].dt.normalize()
10 loops, best of 3: 27.8 ms per loop

In [5]: %timeit df['result_date'].dt.floor('D')
100 loops, best of 3: 13.1 ms per loop

In [6]: %timeit df.groupby(df['result_date'].dt.date).size()
1 loop, best of 3: 883 ms per loop

In [7]: %timeit df.groupby(df['result_date'].dt.normalize()).size()
10 loops, best of 3: 40.2 ms per loop

In [8]: %timeit df.groupby(df['result_date'].dt.floor('D')).size()
10 loops, best of 3: 25.9 ms per loop

EDIT1:

numpy alternative is faster, but as pointed Jeff:

it’s faster but you lose things like time zones and any higher level methods.

In [9]: %timeit df['result_date'].values.astype('datetime64[D]')
100 loops, best of 3: 2.39 ms per loop

3 Comments

In my case it is faster but not so fast as yours: In [77]: %timeit df['result_date'].dt.date 1.44 s ± 42.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
i am not sure why you show numpy at all; sure it’s faster but you lose things like time zones and any higher level methods. this is just confusing to most users.
It is True, so added to answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.