Performance issues with pandas and filtering on datetime column

Question

I've a pandas dataframe with a datetime64 object on one of the columns.

    time    volume  complete    closeBid    closeAsk    openBid openAsk highBid highAsk lowBid  lowAsk  closeMid
0   2016-08-07 21:00:00+00:00   9   True    0.84734 0.84842 0.84706 0.84814 0.84734 0.84842 0.84706 0.84814 0.84788
1   2016-08-07 21:05:00+00:00   10  True    0.84735 0.84841 0.84752 0.84832 0.84752 0.84846 0.84712 0.8482  0.84788
2   2016-08-07 21:10:00+00:00   10  True    0.84742 0.84817 0.84739 0.84828 0.84757 0.84831 0.84735 0.84817 0.847795
3   2016-08-07 21:15:00+00:00   18  True    0.84732 0.84811 0.84737 0.84813 0.84737 0.84813 0.84721 0.8479  0.847715
4   2016-08-07 21:20:00+00:00   4   True    0.84755 0.84822 0.84739 0.84812 0.84755 0.84822 0.84739 0.84812 0.847885
5   2016-08-07 21:25:00+00:00   4   True    0.84769 0.84843 0.84758 0.84827 0.84769 0.84843 0.84758 0.84827 0.84806
6   2016-08-07 21:30:00+00:00   5   True    0.84764 0.84851 0.84768 0.84852 0.8478  0.84857 0.84764 0.84851 0.848075
7   2016-08-07 21:35:00+00:00   4   True    0.84755 0.84825 0.84762 0.84844 0.84765 0.84844 0.84755 0.84824 0.8479
8   2016-08-07 21:40:00+00:00   1   True    0.84759 0.84812 0.84759 0.84812 0.84759 0.84812 0.84759 0.84812 0.847855
9   2016-08-07 21:45:00+00:00   3   True    0.84727 0.84817 0.84743 0.8482  0.84743 0.84822 0.84727 0.84817 0.84772

My application follows the (simplified) structure below:

class Runner():
    def execute_tick(self, clock_tick, previous_tick):
        candles = self.broker.get_new_candles(clock_tick, previous_tick)
        if candles:
            run_calculations(candles)

class Broker():
    def get_new_candles(clock_tick, previous_tick)
        start = previous_tick - timedelta(minutes=1)
        end = clock_tick - timedelta(minutes=3)
        return df[(df.time > start) & (df.time <= end)]

I noticed when profiling the app, that calling the df[(df.time > start) & (df.time <= end)] causes the highest performance issues and I was wondering if there is a way to speed up these calls?

EDIT: I'm adding some more info about the use-case here (also, source is available at: https://github.com/jmelett/pyFxTrader)

The application will accept a list of instruments (e.g. EUR_USD, USD_JPY, GBP_CHF) and then pre-fetch ticks/candles for each one of them and their timeframes (e.g. 5 minutes, 30 minutes, 1 hour etc.). The initialised data is basically a dict of Instruments, each containing another dict with candle data for M5, M30, H1 timeframes.
Each "timeframe" is a pandas dataframe like shown at the top
A clock simulator is then used to query the individual candles for the specific time (e.g. at 15:30:00, give me the last x "5-minute-candles") for EUR_USD
This piece of data is then used to "simulate" specific market conditions (e.g. average price over last 1 hour increased by 10%, buy market position)

Have you tried using a date range or Timestamps, e.g. df['time']= [pd.Timestamp(idx) for idx in df['time']] — Alexander
– Alexander, Commented Aug 11, 2016 at 17:06
Yep, I tried Timestamps as well but still calling roughly 7'000 times the method takes roughly ~600s — Joseph jun. Melettukunnel
– Joseph jun. Melettukunnel, Commented Aug 11, 2016 at 17:23
Could you give us a bigger context to it? Are you running it in a loop, with maybe different start and end and the df.time staying the same? — Divakar
– Divakar, Commented Aug 11, 2016 at 17:31
Exactly. I working on an event-driven backtesting tool with one object [A] having all the dataframes stored and another object [B] which calls specific dataframes on every "tick" (e.g. 15:00:01, 15:00:02 etc.) from [A]. Does this make sense? Else I can put some example code on GitHub — Joseph jun. Melettukunnel
– Joseph jun. Melettukunnel, Commented Aug 11, 2016 at 17:35
Well I was getting at vectorizing things. But, since you have mentioned dataframes, so that might not be feasible or even worth it. — Divakar
– Divakar, Commented Aug 11, 2016 at 17:38

piRSquared · Accepted Answer · 2016-08-19 18:02:10Z

3

+50

If efficiency is your goal, I'd use numpy for just about everything

I rewrote get_new_candles as get_new_candles2

def get_new_candles2(clock_tick, previous_tick):
    start = previous_tick - timedelta(minutes=1)
    end = clock_tick - timedelta(minutes=3)
    ge_start = df.time.values >= start.to_datetime64()
    le_end = df.time.values <= end.to_datetime64()
    return pd.DataFrame(df.values[ge_start & le_end], df.index[mask], df.columns)

Setup of data

from StringIO import StringIO
import pandas as pd

text = """time,volume,complete,closeBid,closeAsk,openBid,openAsk,highBid,highAsk,lowBid,lowAsk,closeMid
2016-08-07 21:00:00+00:00,9,True,0.84734,0.84842,0.84706,0.84814,0.84734,0.84842,0.84706,0.84814,0.84788
2016-08-07 21:05:00+00:00,10,True,0.84735,0.84841,0.84752,0.84832,0.84752,0.84846,0.84712,0.8482,0.84788
2016-08-07 21:10:00+00:00,10,True,0.84742,0.84817,0.84739,0.84828,0.84757,0.84831,0.84735,0.84817,0.847795
2016-08-07 21:15:00+00:00,18,True,0.84732,0.84811,0.84737,0.84813,0.84737,0.84813,0.84721,0.8479,0.847715
2016-08-07 21:20:00+00:00,4,True,0.84755,0.84822,0.84739,0.84812,0.84755,0.84822,0.84739,0.84812,0.847885
2016-08-07 21:25:00+00:00,4,True,0.84769,0.84843,0.84758,0.84827,0.84769,0.84843,0.84758,0.84827,0.84806
2016-08-07 21:30:00+00:00,5,True,0.84764,0.84851,0.84768,0.84852,0.8478,0.84857,0.84764,0.84851,0.848075
2016-08-07 21:35:00+00:00,4,True,0.84755,0.84825,0.84762,0.84844,0.84765,0.84844,0.84755,0.84824,0.8479
2016-08-07 21:40:00+00:00,1,True,0.84759,0.84812,0.84759,0.84812,0.84759,0.84812,0.84759,0.84812,0.847855
2016-08-07 21:45:00+00:00,3,True,0.84727,0.84817,0.84743,0.8482,0.84743,0.84822,0.84727,0.84817,0.84772
"""

df = pd.read_csv(StringIO(text), parse_dates=[0])

Test input variables

previous_tick = pd.to_datetime('2016-08-07 21:10:00')
clock_tick = pd.to_datetime('2016-08-07 21:45:00')

get_new_candles2(clock_tick, previous_tick)

Timing

answered Aug 19, 2016 at 18:02

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

jezrael Over a year ago

Very nice and 3 times faster.

CptSnuggles Over a year ago

I don't understand where to get the variable "mask" from. You have not declared it in get_new_candles2().

ptrj · Accepted Answer · 2016-08-13 16:14:42Z

0

It's a bit of speculation because I can't test it but two ideas come to my mind.

Using a lookup series to determine the start and end indices of a data frame to be returned:

s = pd.Series(np.arange(len(df)), index=df.time)
start = s.asof(start)
end = s.asof(end)
ret = df.iloc[start + 1 : end]

Setting df.time column as an index and taking a slice. (It may or may not be a good solution for other reasons since this column contains duplicates.)
```
df = df.set_index('time')
ret = df.loc[start : end]
```
You may need to add some small Timedelta to start.

In both cases you can do the main step (construct the series or set the index) only once per each data frame.

answered Aug 13, 2016 at 16:14

ptrj

5,23221 silver badges31 bronze badges

Comments

den.run.ai · Accepted Answer · 2016-08-15 23:19:36Z

0

Pandas query can use numexpr as engine to speed up evaluation:

df.query('time > @start & time <= @end')

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html

answered Aug 15, 2016 at 23:19

den.run.ai

5,97318 gold badges88 silver badges146 bronze badges

4 Comments

Joseph jun. Melettukunnel Over a year ago

Thanks for the suggestion! I just tried it out with 7000 rows and unfortuntely it slowed down the performance by ~120%.

den.run.ai Over a year ago

which version of python, pandas, numpy, and numexpr are you using?

den.run.ai Over a year ago

also can you try explicitly passing engine='numexpr' keyword argument to query?

den.run.ai Over a year ago

in this section it is explained that for <20k rows the plain python version is faster than numexpr: pandas.pydata.org/pandas-docs/stable/…

ℕʘʘḆḽḘ · Accepted Answer · 2016-08-16 12:02:35Z

0

it seems to me that using the ix locator is faster

df.sort_values(by='time',inplace=True)
df.ix[(df.time > start) & (df.time <= end),:]

answered Aug 16, 2016 at 12:02

ℕʘʘḆḽḘ

19.5k39 gold badges148 silver badges259 bronze badges

Comments

Community · Accepted Answer · 2017-05-23 10:29:18Z

0

I've learned that those datetime objects can become very memory hungry and require more computational effort, Especially if they are set to be the index (DatetimeIndex objects?)

I think your best bet is to just cast the df.time, start and end into UNIX timestamps (as ints, no longer the datetime dtypes), and do a simple integer comparison.

The UNIX timestamp will look like: 1471554233 (the time of this posting). More on that here: https://en.wikipedia.org/wiki/Unix_time

Some considerations when doing this (eg keep in mind timezones): Convert datetime to Unix timestamp and convert it back in python

edited May 23, 2017 at 10:29

CommunityBot

11 silver badge

answered Aug 18, 2016 at 21:06

Mansweet

1517 bronze badges

Comments

Gustavo Bezerra · Accepted Answer · 2016-08-19 10:57:37Z

I guess you are already running things in a relatively efficient way.

When working with time series, it's usually best practice use the column using your timestamps as the DataFrame index. Using a RangeIndex as your index isn't of much use. However, I ran a couple of tests on a (2650069, 2) DataFrame containing 6 months of trade tick data from a given stock at a given exchange and it turns out your approach (creating a boolean array and using it to slice the DataFrame) seems to be 10x faster than regular DatetimeIndex slicing (which I thought was faster).

The data I tested looks like:

                                Price  Volume
time                                         
2016-02-10 11:16:15.951403000  6197.0   200.0
2016-02-10 11:16:16.241380000  6197.0   100.0
2016-02-10 11:16:16.521871000  6197.0   900.0
2016-02-10 11:16:16.541253000  6197.0   100.0
2016-02-10 11:16:16.592049000  6196.0   200.0

Setting start/end:

start = df.index[len(df)/4]
end = df.index[len(df)/4*3]

Test 1:

%%time
_ = df[start:end]  # Same for df.ix[start:end]

CPU times: user 413 ms, sys: 20 ms, total: 433 ms
Wall time: 430 ms

On the other hand, using your approach:

df = df.reset_index()
df.columns = ['time', 'Price', 'Volume']

Test 2:

%%time
u = (df['time'] > start) & (df['time'] <= end)

CPU times: user 21.2 ms, sys: 368 µs, total: 21.6 ms
Wall time: 20.4 ms

Test 3:

%%time
_ = df[u]

CPU times: user 10.4 ms, sys: 27.6 ms, total: 38.1 ms
Wall time: 36.8 ms

Test 4:

%%time
_ = df[(df['time'] > start) & (df['time'] <= end)]

CPU times: user 21.6 ms, sys: 24.3 ms, total: 45.9 ms
Wall time: 44.5 ms

Note: Each code block corresponds to an Jupyter notebook cell and its output. I am using the %%time magic because %%timeit usually yields some caching problems which make the code seems faster than it actually is. Also, the kernel has been restarted after each run.

I am not totally sure why that is the case (I thought slicing using a DatetimeIndex would make things faster), but I guess it probably has something to do with how things work under the hood with numpy (most likely the datetime slicing operation generates a boolean array which is then used internally by numpy to actually do the slicing - but don't quote me on that).

Collectives™ on Stack Overflow

Performance issues with pandas and filtering on datetime column

6 Answers 6

Setup of data

Test input variables

Timing

2 Comments

Comments

4 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

Setup of data

Test input variables

Timing

2 Comments

Comments

4 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related