4

I have a large dataframe of the form

timestamp | col1 | col2 ...

I want to select rows spaced out by an interval of at least x minutes, where x can be 5,10,30, etc. The problem is the timestamps arent equally spaced, so I cant do a simple "take every nth row" trick.

Example:

timestamp | col1 | col2

'2019-01-15 17:52:29.955000', x, b
'2019-01-15 17:58:29.531000', x, b
'2019-01-16 03:21:48.255000', x, b
'2019-01-16 03:27:46.324000', x, b
'2019-01-16 03:33:09.984000', x, b
'2019-01-16 07:22:08.170000', x, b
'2019-01-16 07:28:27.406000', x, b
'2019-01-16 07:34:35.194000', x, b

if interval = 10:

result:

'2019-01-15 17:52:29.955000', x, b
'2019-01-16 03:21:48.255000', x, b
'2019-01-16 03:33:09.984000', x, b
'2019-01-16 07:22:08.170000', x, b
'2019-01-16 07:34:35.194000', x, b

if interval = 30:

result:

'2019-01-15 17:52:29.955000', x, b
'2019-01-16 03:21:48.255000', x, b
'2019-01-16 07:22:08.170000', x, b

I could do a brute force n^2 approach, but I'm sure theres a pandas way for this that im missing..

Thank you! :)

EDIT: It is not a duplicate of Calculate time difference between Pandas Dataframe indices just to clarify. I need to subset a dataframe based on a given interval

9
  • I think a loop is necessary here. Dropping one row then impacts the decision of all other rows, for instance if you had rows 1, 2, 3, 4, 12, 20, 27 you don't know to keep 12 until you've dropped 2 3 and 4 for being too close to 1 (if the diff is >10). Commented Jul 10, 2019 at 17:57
  • @ALollz Not really, if you instead generated an index of rows to keep, you could subset them immediately, no? Commented Jul 10, 2019 at 18:02
  • @steven how is this a duplicate? Its asking a completely different thing. Please do not flag unnecessarily. Commented Jul 10, 2019 at 18:12
  • 1
    looks like you need to make do with a simple for loop. It's O(n), not O(n**2). Commented Jul 10, 2019 at 18:13
  • you just need to calculate the difference between rows and then select between on it right? Commented Jul 10, 2019 at 18:14

1 Answer 1

5

Like commented, it looks like you need to do a for loop. And it is not too bad because you are doing an O(n) loop:

def sampling(df, thresh):
    thresh = pd.to_timedelta(thresh)
    time_diff = df.timestamp.diff().fillna(pd.Timedelta(seconds=0))
    ret = [0]
    running_total = pd.to_timedelta(0)
    for i in df.index:
        running_total += time_diff[i]
        if running_total >= thresh:
            ret.append(i)
            running_total = pd.to_timedelta(0)

    return df.loc[ret].copy()

Then sampling(df, '10T') gives

                timestamp col1 col2
0 2019-01-15 17:52:29.955    x    b
2 2019-01-16 03:21:48.255    x    b
4 2019-01-16 03:33:09.984    x    b
5 2019-01-16 07:22:08.170    x    b
7 2019-01-16 07:34:35.194    x    b

and sampling(df, '30T') gives:

                timestamp col1 col2
0 2019-01-15 17:52:29.955    x    b
2 2019-01-16 03:21:48.255    x    b
5 2019-01-16 07:22:08.170    x    b
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you! :) '10T' gave me an error but '10M' worked fine, thanks alot!
Hm.. Was looking for a non-for-loop solution. But looks good.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.