pandas select rows with given timestamp interval

Question

I have a large dataframe of the form

timestamp | col1 | col2 ...

I want to select rows spaced out by an interval of at least x minutes, where x can be 5,10,30, etc. The problem is the timestamps arent equally spaced, so I cant do a simple "take every nth row" trick.

Example:

timestamp | col1 | col2

'2019-01-15 17:52:29.955000', x, b
'2019-01-15 17:58:29.531000', x, b
'2019-01-16 03:21:48.255000', x, b
'2019-01-16 03:27:46.324000', x, b
'2019-01-16 03:33:09.984000', x, b
'2019-01-16 07:22:08.170000', x, b
'2019-01-16 07:28:27.406000', x, b
'2019-01-16 07:34:35.194000', x, b

if interval = 10:

result:

'2019-01-15 17:52:29.955000', x, b
'2019-01-16 03:21:48.255000', x, b
'2019-01-16 03:33:09.984000', x, b
'2019-01-16 07:22:08.170000', x, b
'2019-01-16 07:34:35.194000', x, b

if interval = 30:

result:

'2019-01-15 17:52:29.955000', x, b
'2019-01-16 03:21:48.255000', x, b
'2019-01-16 07:22:08.170000', x, b

I could do a brute force n^2 approach, but I'm sure theres a pandas way for this that im missing..

Thank you! :)

EDIT: It is not a duplicate of Calculate time difference between Pandas Dataframe indices just to clarify. I need to subset a dataframe based on a given interval

I think a loop is necessary here. Dropping one row then impacts the decision of all other rows, for instance if you had rows 1, 2, 3, 4, 12, 20, 27 you don't know to keep 12 until you've dropped 2 3 and 4 for being too close to 1 (if the diff is >10). — ALollz
– ALollz, Commented Jul 10, 2019 at 17:57
@ALollz Not really, if you instead generated an index of rows to keep, you could subset them immediately, no? — Wboy
– Wboy, Commented Jul 10, 2019 at 18:02
@steven how is this a duplicate? Its asking a completely different thing. Please do not flag unnecessarily. — Wboy
– Wboy, Commented Jul 10, 2019 at 18:12
looks like you need to make do with a simple for loop. It's O(n), not O(n**2). — Quang Hoang
– Quang Hoang, Commented Jul 10, 2019 at 18:13
you just need to calculate the difference between rows and then select between on it right? — steven
– steven, Commented Jul 10, 2019 at 18:14

Quang Hoang · Accepted Answer · 2019-07-10 18:27:07Z

5

Like commented, it looks like you need to do a for loop. And it is not too bad because you are doing an O(n) loop:

def sampling(df, thresh):
    thresh = pd.to_timedelta(thresh)
    time_diff = df.timestamp.diff().fillna(pd.Timedelta(seconds=0))
    ret = [0]
    running_total = pd.to_timedelta(0)
    for i in df.index:
        running_total += time_diff[i]
        if running_total >= thresh:
            ret.append(i)
            running_total = pd.to_timedelta(0)

    return df.loc[ret].copy()

Then sampling(df, '10T') gives

                timestamp col1 col2
0 2019-01-15 17:52:29.955    x    b
2 2019-01-16 03:21:48.255    x    b
4 2019-01-16 03:33:09.984    x    b
5 2019-01-16 07:22:08.170    x    b
7 2019-01-16 07:34:35.194    x    b

and sampling(df, '30T') gives:

                timestamp col1 col2
0 2019-01-15 17:52:29.955    x    b
2 2019-01-16 03:21:48.255    x    b
5 2019-01-16 07:22:08.170    x    b

edited Jul 10, 2019 at 18:27

answered Jul 10, 2019 at 18:08

Quang Hoang

151k11 gold badges64 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Wboy Over a year ago

Thank you! :) '10T' gave me an error but '10M' worked fine, thanks alot!

gies0r Over a year ago

Hm.. Was looking for a non-for-loop solution. But looks good.

Collectives™ on Stack Overflow

pandas select rows with given timestamp interval

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related